KNN Regression Using sklearn Module in Python

K-Nearest Neighbors algorithms are used in classification as well as regression. This article discusses the implementation of the KNN regression algorithm using the sklearn module in Python.

What is KNN Regression?

KNN Regression, also known as k-nearest neighbors regression, is a non-parametric machine learning algorithm used for regression in machine learning.

The K-Nearest Neighbors regression algorithm predicts the value of a target variable for a new observation by finding the k-nearest observations in the training data set and calculating the average of their target variable values. Here, the number k is a hyperparameter that the user must choose. It determines how many neighbors to consider when making a prediction.

The KNN Regression algorithm does not make any assumptions about the underlying distribution of the data and can work well with non-linear and noisy data. However, it can be computationally expensive for large datasets, and the choice of k can have a significant impact on the model’s performance. In general, a smaller value of k will lead to a more complex model with high variance, while a larger value of k will lead to a simpler model with high bias.

K-Nearest Neighbors Regression Algorithm

The KNN regression algorithm can be implemented using the following steps.

  1. Choose the value of k: The first step in KNN regression is to choose the number of nearest neighbors, k, to use for the prediction. This value can be determined through cross-validation or other techniques.
  2. Calculate the distance of new data point to existing data points: For each new data point that needs to be predicted, calculate the distance between the new data point and all the other data points in the training set using a distance metric such as Euclidean distance.
  3. Find the k-nearest neighbors: Next, we select the k data points with the shortest distance to the new data point. These data points are the k-nearest neighbors.
  4. Calculate the average of the target variables of the neighbors: Take the average of the target variable values of the k-nearest neighbors. This value is the predicted target variable value for the new data point.
  5. Repeat: Repeat steps 3-5 for all new data points that need to be predicted.

To learn more about the KNN regression algorithm, its applications, advantages, and disadvantages, you can read this article on KNN regression numerical example. This article discusses a step-by-step numerical example for K-Nearest regression that will help you understand this algorithm in a better way.

The KNeighborsRegressor() Function

We will use the KNeighborsRegressor() function to implement KNN regression using the sklearn module in python. It has the following syntax.

sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

Here, 

  • The n_neighbors parameter takes the number of parameters that we use to evaluate the dependent variable of a new value. By default, it is 5.
  • The weights parameter is used to specify the weightage of neighbors. By default, it is set to “uniform”. This means that all the neighbors are given equal weightage. 
    • You can set the weights parameter to ‘distance’ to weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors further away.
    • You can also specify a user-defined function that accepts an array of distances, and returns an array of the same shape containing the weights of the neighbors.
  • The algorithm parameter is used to specify the procedure to compute the nearest neighbors. By default, it is set to ‘auto’. This means that the KNeighborsRegressor() function will attempt to decide the most appropriate algorithm based on the values in the training data.
  • The metric parameter is used to specify the distance metric to compute distances between the data points. By default, it is set to “minkowski”. You can also specify other distance metrics such as “euclidean”, “manhattan”, and others. 
    • If you want to specify a specific distance metric, you can pass a function to the metric parameter. The function must take two data points as input arguments and return the distance between them.
    • You can also set the metric parameter to “precomputed”. After this, you need to pass a precomputed distance matrix instead of the training dataset to the fit() method while training the KNN regression model.
  • The n_jobs parameter is used to specify the number of parallel jobs to run for neighbor search each time we want to predict a value. By default, it is set to None, which means that only one execution happens. If you want to run the algorithm parallelly using all the processors in your system, you can set the n_jobs parameter to -1. 
  • The parameter p is used to specify the power parameter for the Minkowski metric. When p is set to 1, this is equivalent to using manhattan_distance (l1). When we set p=2, which is its default value, the Minkowski metric works as the euclidean distance metric.

After execution, the KNeighborsRegressor() function returns an untrained K-Nearest Neighbors machine learning model. We can train this model and use it for KNN regression. 

KNN Regression Using the sklearn Module in Python

Now that we have discussed the algorithm and the KNeighborsRegressor() function, let us now implement the KNN regression algorithm using the sklearn module in Python. 

To implement the algorithm, we will use the following dataset.

LengthWeightCost
101545
11637
121448
7933
91438
81240
61135
151050
14846
71235
10636
13844
9732
5830
51030
Dataset for KNN Regression

The dataset contains the length, weight, and cost of rods of a given metal. Here, length and weight are independent variables while cost is a target variable.  Using the sklearn module in python, we will implement KNN regression to find the cost of a rod with length 7 and weight 8.

Steps to Implement KNN Regression Using the sklearn Module in Python

To implement the KNN regression algorithm using the sklearn module in Python, we will use the following steps. 

  • First, we will create an array of data points with their independent variables. Let’s name it data_points. We will also create another array containing dependent variables for the data points at the corresponding position in the data_points array. Let’s name this array target_values.In this example, the data set is pretty simple. In large datasets, you need to perform data preprocessing to make the data suitable for the regression model.
  • Next, we will create an untrained KNN regression model using the KNeighborsRegressor() function. Here we will specify the number of neighbors as 3 and the distance metric as euclidean. The KNeighborsRegressor() function will return an untrained machine-learning model after execution.
  • Now, we will use the fit() method to train the untrained model. The fit() method takes the array of data points i.e. data_points as its first input argument and the array of target values i.e. target_values as its second input argument. After execution, the fit() method returns the trained KNN regression model. For datasets having multiple attributes, you can also pass the dataframe containing independent attributes as its first input argument and a series or array of target values for the data points as the second input argument. 
  • Once we get the trained KNN regression model, we will use the predict() method to predict the target value for a given query data point. The predict() method, when invoked on the trained KNN regression model, takes an array of query data points as its input argument. After execution, it returns an array of predicted target values for each data point in the input dataset. We will pass the array [(7,8)] to predict the target value for the data point (7, 8).

All the above steps are implemented in the following example.

from sklearn.neighbors import KNeighborsRegressor
#create list of data points
data_points=[(10,15),(11,6),(12,14),(7,9),(9,14),(8,12),(6,11),(15,10),(14,8),(7,12),(10,6),(13,8),(9,7),(5,8),(5,10)]
#create list of target values
target_values=[45,37,48, 33,38,40,35,50,46,35,36,44,32,30,30]
#create untrained model
untrained_model=KNeighborsRegressor(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,target_values)
#predict class for a new data point
predicted_value=trained_model.predict([(7,8)])
print("The data points are:")
print(data_points)
print("The target values are:")
print(target_values)
print("The predicted value for (7,8) is:")
print(predicted_value)

Output:

The data points are:
[(10, 15), (11, 6), (12, 14), (7, 9), (9, 14), (8, 12), (6, 11), (15, 10), (14, 8), (7, 12), (10, 6), (13, 8), (9, 7), (5, 8), (5, 10)]
The target values are:
[45, 37, 48, 33, 38, 40, 35, 50, 46, 35, 36, 44, 32, 30, 30]
The predicted value for (7,8) is:
[31.66666667]

In the above output, you can observe that we have predicted the target value of 31.67 for the data point (7, 8) using the KNN regression algorithm and sklearn module in python.

Find the Features in a KNN Regression Model

You can find the number of features used to train the KNN regression model using the n_features_in_ attribute of the trained machine learning model. You can also find the name of all the attributes in the training dataset using the feature_names_in_ attribute of the model as shown below.

from sklearn.neighbors import KNeighborsRegressor
#create list of data points
data_points=[(10,15),(11,6),(12,14),(7,9),(9,14),(8,12),(6,11),(15,10),(14,8),(7,12),(10,6),(13,8),(9,7),(5,8),(5,10)]
#create list of target values
target_values=[45,37,48, 33,38,40,35,50,46,35,36,44,32,30,30]
#create untrained model
untrained_model=KNeighborsRegressor(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,target_values)
print("The number of features in training data is:")
print(trained_model.n_features_in_)

Output:

The number of features in training data is:
2

Find the number of data points in the training dataset

To find the total number of data points used while training the KNN regression model, you can use the n_samples_fit_ attribute of the model as shown below.

from sklearn.neighbors import KNeighborsRegressor
#create list of data points
data_points=[(10,15),(11,6),(12,14),(7,9),(9,14),(8,12),(6,11),(15,10),(14,8),(7,12),(10,6),(13,8),(9,7),(5,8),(5,10)]
#create list of target values
target_values=[45,37,48, 33,38,40,35,50,46,35,36,44,32,30,30]
#create untrained model
untrained_model=KNeighborsRegressor(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,target_values)
print("The number of data points in training data is:")
print(trained_model.n_samples_fit_)

Output:

The number of data points in training data is:
15

Conclusion

In this article, we have discussed the implementation of the K-Neareset neighbors (KNN) regression algorithm using the sklearn module in Python. To learn more topics in machine learning, you can read this article on KNN classification using the sklearn module in python. You might also like this article on hierarchical clustering for mixed and categorical data in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts