Multiple Regression Analysis Using sklearn Module in Python
Regression analysis is used in machine learning for various prediction and classification problems. In this article, we will discuss multiple regression analysis, its assumptions, and its uses. We will also implement multiple regression analysis using the sklearn module in Python.
What is Multiple Regression Analysis?
Multiple regression analysis is a linear regression technique used to create prediction and classification models when we have a dataset in which independent variables and dependent variables are linearly related.
In multiple regression analysis, we try to find the relationship between the independent and dependent variables. For this, we try to find the best fit line so that the predicted dependent variable for a set of independent variables is closest to the actual dependent variable.
Multiple regression finds its applicability in various tasks like demand forecasting, time series analysis, and classification.
Multiple Regression Analysis Definition
Suppose that we are given a dataset containing independent variables
X1, X2, X3, X4, X5 to XN and dependent variable Y. We can define multiple regression analysis as a process to find the coefficients A0 to AN such that line
A0+A1X1i+A2X2i+A3X3i+A4X4i+A5X5i+.....+ANXNi is closest to Yi for entry i in the dataset. Basically, we need to find the equation of the following regression line.
For deriving the above equation, we use the least-squares method to find the error term while creating the best-fit regression equation for the given dataset.
Now that we have discussed the definition of multiple regression analysis, let us discuss an example of multiple regression analysis with its implementation in Python.
Multiple Regression Analysis Example
Suppose that we are given a dataset containing two independent variables and a dependent variable as shown in the following table.
Here, We have been given the weight of a pillar, its radius, and the corresponding height of the pillar. The weight and radius attributes are independent variables while height is a dependent variable.
In multiple regression analysis, we need to find the coefficients A0, A1, and A2 such that
A0+ A1*weight+A2*raidus is closest to the height of the pillar for any given set of weight, radius, and pillar.
Upon calculation, you can find the regression equation as follows.
height = 8.10913*weight - 21.3242*radius - 36.81461.
Now that we have got an overview of what we have to do in performing multiple regression analysis, let us implement multiple regression analysis using sklearn module in Python.
Multiple Regression Using sklearn in Python
To implement multiple regression analysis, we will use three functions defined in the sklearn module in Python. These are the
LinearRegression() function, the
fit() method, and the
LinearRegression()function is used to create a
LinearRegression()model without any training data. After execution, it returns a
fit()method is used to train the linear regression model. It takes a list of the independent variable of the dataset as its first input argument and a list of dependent variables as its second input argument. After execution, it returns a trained linear regression model.
predict()method is used to predict the value of the dependent variable for a given set of independent variables. The
predict()method takes a list of independent variables and returns a list of corresponding dependent variables calculated from the linear regression model.
Before implementing the multiple regression analysis using the sklearn module in Python, we will convert the dataset into the desired format.
- As we know that there are multiple independent variables in our dataset, we will create a list of tuples using the independent variables of the dataset for training the machine learning model.
- If there are N independent variables namely
X1, X2, X3, X4, X5 to XNin the dataset. Each independent variable will have its own list. In the lists of independent variables, the elements at the same position in each list correspond to the same entry in the dataset.
- From the list of independent variables, we will create a list of tuples where each tuple contains N elements. The tuple at position i in the dataset should contain the values
X1i, X2i, X3i, X4i, X5i, ....., XNi. Thus, the tuple at position i in the list of tuples will represent the ith entry in the dataset.
- To create the list of tuples from the lists of each attribute, we will use the zip() method. The
zip()method will take each list
X1, X2, X3, X4, X5 till XNas its input argument. After execution, it will return the list of tuples.
Once we get the list of tuples, we will use it as the vector containing independent variables. To implement multiple regression analysis using the sklearn module in Python, we will use the following steps.
- First, we will create a linear regression model using the
- After that, we will use the
fit()method to train the linear regression model. The
fit()method takes the list of tuples created from the independent variables as its first input argument and the list containing the dependent variable as its second input argument. After execution, it will return a trained machine learning model.
- Once we get the trained machine learning model, you can access the coefficients of the independent variables in the trained model.
- To access the coefficient of the independent variables in the trained linear regression model, you can use the
coef_attribute. It contains the coefficients of all the independent variables in a list.
- To access the constant term in the linear regression equation, you can use the
intercept_attribute of the linear regression model.
Following is the implementation of multiple regression analysis using the sklearn module in Python for the dataset given in the table.
from sklearn.linear_model import LinearRegression import numpy weights=[30,40,50,60,70] radii=[5,7.8,9.9,12.7,14.6] inputs=list(zip(weights,radii)) heights=numpy.array([100,123,155,178,221]).reshape(-1, 1) regression_model=LinearRegression() regression_model.fit(inputs,heights) print("The Coefficients are:",regression_model.coef_) print("The intercept is:",regression_model.intercept_)
The Coefficients are: [[ 8.10913242 -21.32420091]] The intercept is: [-36.81461187]
To predict the value of the dependent variable for a new set of independent variables, you can use the
predict() method, when invoked on a linear regression model, accepts a list of tuples of independent variables as its input argument. After execution, it returns a list of predicted values for the dependent variables.You can observe this in the following example.
from sklearn.linear_model import LinearRegression import numpy weights=[30,40,50,60,70] radii=[5,7.8,9.9,12.7,14.6] inputs=list(zip(weights,radii)) heights=numpy.array([100,123,155,178,221]).reshape(-1, 1) regression_model=LinearRegression() regression_model.fit(inputs,heights) input_values=[(80,16),(90,18.5)] print("The input values are:",input_values) output_values=regression_model.predict(input_values) print("The output values are:",output_values)
The input values are: [(80, 16), (90, 18.5)] The output values are: [[270.72876712] [298.50958904]]
Suggestions to Build a Better Model for Multiple Regression Analysis
Implementing multiple regression analysis using a linear regression model is simple. However, it has its limitations and drawbacks. Let us discuss some of the limitations and possible solutions that will help you build better machine learning applications using regression.
- Linear regression techniques work with numeric data. If your data contains categorical attributes, you must annotate the data while data cleaning to make the data numeric.
- The linear regression algorithm assumes that all the independent variables are linearly related to the dependent variable. If your dataset contains independent attributes that are not linearly related to the dependent variable, the machine learning model will not give accurate results.
- To verify if any independent variable is linearly related to the dependent variable or not, you can use scatter plots.Linear regression has another assumption that the independent variables aren’t highly correlated. If so, the output machine learning model will not will accurate result.
- While data preprocessing, you must check if any of the independent variables are highly correlated with each other. If there exists a pair of highly correlated variables, you can choose to drop one of the attributes from the dataset.
- After implementing the regression model, you should verify that the errors have a normal distribution along the best-fit regression line. Additionally, the variance along the regression line should remain constant throughout the linear regression line.
In this article, we have discussed the basics and examples of multiple regression analysis. We also discussed the implementation of multiple regression analysis using the sklearn module in Python.
I hope you enjoyed reading this article. To know more about programming, you can read this article on getting started with web applications using PHP. You might also like this article on how to build a chatbot using Python..