Multiple Regression Analysis Using sklearn Module in Python

Regression analysis is used in machine learning for various prediction and classification problems. In this article, we will discuss multiple regression analysis, its assumptions, and its uses. We will also implement multiple regression analysis using the sklearn module in Python.

What is Multiple Regression Analysis?

Multiple regression analysis is a linear regression technique used to create prediction and classification models when we have a dataset in which independent variables and dependent variables are linearly related.

In multiple regression analysis, we try to find the relationship between the independent and dependent variables. For this, we try to find the best fit line so that the predicted dependent variable for a set of independent variables is closest to the actual dependent variable.

Multiple regression finds its applicability in various tasks like demand forecasting, time series analysis, and classification.

Multiple Regression Analysis Definition

Suppose that we are given a dataset containing independent variables X1, X2, X3, X4, X5 to XN and dependent variable Y. We can define multiple regression analysis as a process to find the coefficients A0 to AN such that line A0+A1X1i+A2X2i+A3X3i+A4X4i+A5X5i+.....+ANXNi is closest to Yi for entry i in the dataset. Basically, we need to find the equation of the following regression line.

Y= A0+A1X1+A2X2+A3X3+A4X4+A5X5+.....+ANXN

For deriving the above equation, we use the least-squares method to find the error term while creating the best-fit regression equation for the given dataset.

Now that we have discussed the definition of multiple regression analysis, let us discuss an example of multiple regression analysis with its implementation in Python.

Multiple Regression Analysis Example

Suppose that we are given a dataset containing two independent variables and a dependent variable as shown in the following table.

WeightRadiusHeight
305100
407.8123
509.9155
6012.7178
7014.6221
8016
Dataset for multiple regression analysis

Here, We have been given the weight of a pillar, its radius, and the corresponding height of the pillar. The weight and radius attributes are independent variables while height is a dependent variable. 

In multiple regression analysis, we need to find the coefficients A0, A1, and A2 such that A0+ A1*weight+A2*raidus is closest to the height of the pillar for any given set of weight, radius, and pillar.

Upon calculation, you can find the regression equation as follows.

height = 8.10913*weight - 21.3242*radius - 36.81461.

Now that we have got an overview of what we have to do in performing multiple regression analysis, let us implement multiple regression analysis using sklearn module in Python.

Multiple Regression Using sklearn in Python

To implement multiple regression analysis, we will use three functions defined in the sklearn module in Python. These are the LinearRegression() function, the fit() method, and the predict() method. 

  • The LinearRegression() function is used to create a LinearRegression() model without any training data. After execution, it returns a LinearRegressionModel object. 
  • The fit() method is used to train the linear regression model. It takes a list of the independent variable of the dataset as its first input argument and a list of dependent variables as its second input argument. After execution, it returns a trained linear regression model.
  • The predict() method is used to predict the value of the dependent variable for a given set of independent variables. The predict() method takes a list of independent variables and returns a list of corresponding dependent variables calculated from the linear regression model.

Before implementing the multiple regression analysis using the sklearn module in Python, we will convert the dataset into the desired format.

  • As we know that there are multiple independent variables in our dataset, we will create a list of tuples using the independent variables of the dataset for training the machine learning model.
  • If there are N independent variables namely X1, X2, X3, X4, X5 to XN in the dataset. Each independent variable will have its own list. In the lists of independent variables, the elements at the same position in each list correspond to the same entry in the dataset.
  • From the list of independent variables, we will create a list of tuples where each tuple contains N elements. The tuple at position i in the dataset should contain the values X1i, X2i, X3i, X4i, X5i, ....., XNi. Thus, the tuple at position i in the list of tuples will represent the ith entry in the dataset. 
  • To create the list of tuples from the lists of each attribute, we will use the zip() method. The zip() method will take each list X1, X2, X3, X4, X5 till XN as its input argument. After execution, it will return the list of tuples. 

Once we get the list of tuples, we will use it as the vector containing independent variables. To implement multiple regression analysis using the sklearn module in Python, we will use the following steps.

  • First, we will create a linear regression model using the LinearRegression() Function.
  • After that, we will use the fit() method to train the linear regression model. The fit() method takes the list of tuples created from the independent variables as its first input argument and the list containing the dependent variable as its second input argument. After execution, it will return a trained machine learning model.
  • Once we get the trained machine learning model, you can access the coefficients of the independent variables in the trained model.
  • To access the coefficient of the independent variables in the trained linear regression model, you can use the coef_ attribute. It contains the coefficients of all the independent variables in a list.
  • To access the constant term in the linear regression equation, you can use the intercept_ attribute of the linear regression model.

Following is the implementation of multiple regression analysis using the sklearn module in Python for the dataset given in the table.

from sklearn.linear_model import LinearRegression
import numpy
weights=[30,40,50,60,70]
radii=[5,7.8,9.9,12.7,14.6]
inputs=list(zip(weights,radii))
heights=numpy.array([100,123,155,178,221]).reshape(-1, 1)
regression_model=LinearRegression()
regression_model.fit(inputs,heights)
print("The Coefficients are:",regression_model.coef_)
print("The intercept is:",regression_model.intercept_)

Output:

The Coefficients are: [[  8.10913242 -21.32420091]]
The intercept is: [-36.81461187]

To predict the value of the dependent variable for a  new set of independent variables, you can use the predict() method.

The predict() method, when invoked on a linear regression model, accepts a list of tuples of independent variables as its input argument. After execution, it returns a list of predicted values for the dependent variables.You can observe this in the following example.

from sklearn.linear_model import LinearRegression
import numpy
weights=[30,40,50,60,70]
radii=[5,7.8,9.9,12.7,14.6]
inputs=list(zip(weights,radii))
heights=numpy.array([100,123,155,178,221]).reshape(-1, 1)
regression_model=LinearRegression()
regression_model.fit(inputs,heights)
input_values=[(80,16),(90,18.5)]
print("The input values are:",input_values)
output_values=regression_model.predict(input_values)
print("The output values are:",output_values)

Output:

The input values are: [(80, 16), (90, 18.5)]
The output values are: [[270.72876712]
 [298.50958904]]

Suggestions to Build a Better Model for Multiple Regression Analysis

Implementing multiple regression analysis using a linear regression model is simple. However, it has its limitations and drawbacks. Let us discuss some of the limitations and possible solutions that will help you build better machine learning applications using regression.

  • Linear regression techniques work with numeric data. If your data contains categorical attributes, you must annotate the data while data cleaning to make the data numeric.
  • The linear regression algorithm assumes that all the independent variables are linearly related to the dependent variable. If your dataset contains independent attributes that are not linearly related to the dependent variable, the machine learning model will not give accurate results.
  • To verify if any independent variable is linearly related to the dependent variable or not, you can use scatter plots.Linear regression has another assumption that the independent variables aren’t highly correlated. If so, the output machine learning model will not will accurate result.
  • While data preprocessing, you must check if any of the independent variables are highly correlated with each other. If there exists a pair of highly correlated variables, you can choose to drop one of the attributes from the dataset.
  • After implementing the regression model, you should verify that the errors have a normal distribution along the best-fit regression line. Additionally, the variance along the regression line should remain constant throughout the linear regression line.

Conclusion

In this article, we have discussed the basics and examples of multiple regression analysis. We also discussed the implementation of multiple regression analysis using the sklearn module in Python. 

I hope you enjoyed reading this article. To know more about programming, you can read this article on getting started with web applications using PHP. You might also like this article on how to build a chatbot using Python..

Similar Posts