KNN Regression Numerical Example
K-Nearset neighbors algorithms are one of the easiest ways to implement regression and classification in machine learning. This article discusses the basics of K-Nearest neighbors regression, its applications, advantages, and disadvantages. It also discusses a numerical example of the KNN regression algorithm.
What is Regression?
Regression in machine learning is a statistical method used to model the relationship between a dependent variable and one or more independent variables. We use regression to develop a mathematical equation that can be used to predict the value of the dependent variable based on the values of the independent variables.
There are different types of regression algorithms such as linear regression, multiple regression, polynomial regression, and logistic regression. In all these algorithms, the basic idea is to find a function that maps the independent variables to the dependent variables. You can use the following links to learn more about the regression algorithms.
Now, we will discuss K-Nearest Neighbors i.e. KNN regression to understand the concepts.
What is KNN Regression?
KNN (K-Nearest Neighbors) regression is a type of instance-based learning or non-parametric regression algorithm. It uses the training data to make predictions. It does not learn a model that can be used to make predictions for new data points. The KNN algorithm doesn’t make any assumptions about the training dataset.
It is a simple and straightforward approach to regression that can be used for both regression and classification problems. In KNN regression, we predict the value of a dependent variable for a data point based on the average or mean of the target values of its K nearest neighbors in the training data. The value of K is a user-defined hyperparameter. It determines the number of nearest neighbors used to make the prediction.
KNN regression is different from linear or polynomial regression as it does not make any assumptions about the underlying relationship between the features and the target variables. For instance, linear regression and multiple regression work on the assumption that the dependent variables and the independent variables in a dataset are linearly related. The KNN regression algorithm doesn’t make any such assumptions. Instead, it relies on the patterns in the data to make predictions.
The K-Nearest Neighbors algorithm is a scalable approach to regression. This is a good choice for large datasets as well as small datasets. Also, we can use KNN regression for both continuous and categorical target variables (as in KNN classification). However, it is important to choose the right value of K, as a small value can lead to overfitting and a large value can lead to underfitting.
Now that we have an idea of what the K-Nearest Neighbors regression algorithm looks like, let us discuss the algorithm first.
KNN Regression Algorithm
The KNN regression algorithm can be broken down into the following steps:
- Choose a value for K: We first choose a value for K. This determines the number of nearest neighbors used to make the prediction.
- Calculate the distance: After choosing K, we calculate the distance between each data point in the training set and the target data point for which a prediction is being made. For this, we can use a variety of distance metrics, including Euclidean distance, Manhattan distance, or Minkowski distance.
- Find the K nearest neighbors: After calculating the distance between the existing data points and the new data point, we identify K nearest neighbors by selecting the K data points nearest to the new data point.
- Calculate the prediction: After finding the neighbors, we calculate the value of the dependent variable for the new data point. For this, we take the average of the target values of the K nearest neighbors.
Hence, after executing the above steps, we can find the value of the dependent variable for any set of data points. Let us now discuss a numerical example to understand the KNN regression algorithm in a better way.
KNN Regression Numerical Example
To discuss the numerical example of N-Nearset Neighbors Regression, we will use the following dataset.
Length | Weight | Cost |
10 | 15 | 45 |
11 | 6 | 37 |
12 | 14 | 48 |
7 | 9 | 33 |
9 | 14 | 38 |
8 | 12 | 40 |
6 | 11 | 35 |
15 | 10 | 50 |
14 | 8 | 46 |
7 | 12 | 35 |
10 | 6 | 36 |
13 | 8 | 44 |
9 | 7 | 32 |
5 | 8 | 30 |
5 | 10 | 30 |
In the above dataset, we have 15 data points. The dataset contains the length and weight of metal rods along with their cost. Now, suppose that we want to calculate the cost for a rod with a length of 7 and a weight of 8. For this, we will use the following steps.
First, we will decide on the value of K. We will take 3 as the number of closest neighbors used to decide the cost of the input data point.
Next, we will calculate the distance of the new data point i.e. (7, 8) to all the existing points in the dataset. Here, we will use the euclidean distance measure. I have tabulated the distances in the below table.
Point | Distance from (7,8) |
(10, 15) | 7.61 |
(11, 6) | 4.47 |
(12, 14) | 7.81 |
(7, 9) | 1.0 |
(9, 14) | 6.32 |
(8, 12) | 4.12 |
(6, 11) | 3.16 |
(15, 10) | 8.24 |
(14, 8) | 7.0 |
(7, 12) | 4.0 |
(10, 6) | 3.60 |
(13, 8) | 6.0 |
(9, 7) | 2.23 |
(5, 8) | 2.0 |
(5, 10) | 2.82 |
Now, we have found the distance of all the data points in the dataset from the point (7, 8). Next, we have to find the three closest points in the dataset. For this, we will sort the points according to their distances from (7, 8). The result is tabulated below.
Point | Distance from (7,8) |
(7, 9) | 1 |
(5, 8) | 2 |
(9, 7) | 2.23 |
(5, 10) | 2.82 |
(6, 11) | 3.16 |
(10, 6) | 3.6 |
(7, 12) | 4 |
(8, 12) | 4.12 |
(11, 6) | 4.47 |
(13, 8) | 6 |
(9, 14) | 6.32 |
(14, 8) | 7 |
(10, 15) | 7.61 |
(12, 14) | 7.81 |
(15, 10) | 8.24 |
In the above table, you can observe that the three points closest to (7, 8) are (7, 9), (5, 8), and (9, 7). These points have costs of 33, 30, and 32.
To calculate the cost of a rod with length 7 and weight 8, we can take the average of the above costs. Hence, the cost for (7, 8) will be 31.67.
Thus, we have found the cost of the rod using the given dataset and the KNN regression algorithm in this numerical example.
Now, we will discuss some of the applications, advantages, and disadvantages of the K-Nearest Neighbors (KNN) regression algorithm.
Applications of K-Nearest Neighbors Regression
KNN regression is a simple yet powerful machine-learning algorithm that has a wide range of applications. Some of the areas where KNN regression is commonly used are as follows.
- Real-time prediction: KNN regression can be used for real-time prediction. It is fast and efficient during the prediction stage. This makes it suitable for applications such as stock price prediction, weather forecasting, and demand forecasting.
- Recommender systems: You can use KNN regression for recommender systems, such as movie and product recommenders. In these applications, the KNN algorithm is used to identify the nearest neighbors in the training set for a given user, and the prediction is made based on the preferences of the nearest neighbors.
- Customer Segmentation: K-nearest neighbors regression can be a great tool to implement customer segmentation. Based on the features like recency, frequency, average monetary value, and variety, each customer can be assigned a score. The scores and features of the customer in the dataset can then be used to find the score for new customers.
- Medical diagnosis: KNN regression can be used for medical diagnosis. Based on the metrics in the test reports, it can identify the probability of any disease in the patient by matching it with the historical data. In a similar way, we can use KNN regression to find similar protein structures based on their chemical behavior. This makes the KNN regression algorithm useful for applications such as disease diagnosis and drug discovery.
- Quality control: KNN regression can be used for quality control, as it is able to identify patterns in the data that are relevant for quality control. Based on the features of the goods, it can easily identify the quality by looking for similar goods data in the dataset. This makes it useful for applications such as process monitoring and quality inspection.
Advantages of KNN Regression
The K-nearest neighbors algorithm has various advantages as discussed below.
- Simple to understand and implement: KNN regression is one of the simplest machine learning algorithms. It is easy to understand and implement, making it accessible to both practitioners and researchers.
- No assumptions about data distribution: Unlike many other regression algorithms, KNN regression does not make any assumptions about the distribution of the data. This makes it suitable for a wide range of datasets, including those with complex or non-linear relationships between features and target variables.
- Handle noisy data well: KNN regression is able to handle noisy data well, as it is less sensitive to outliers and extreme values in the data.
- Versatile: KNN regression can be used for both regression and classification problems and can handle both continuous and categorical target variables.
- Can be used for online learning: KNN regression can be used for online learning, which means it can be updated incrementally as new data becomes available. This can give accurate results in real time.
- Can work well with small datasets: Unlike some other algorithms that require a large amount of data to work well, KNN regression can still produce good results with small datasets, as long as the data is representative of the problem space.
Disadvantages of KNN Regression
Apart from its advantages, the KNN regression algorithm also has many disadvantages. Some of them are discussed below.
- Computational cost: While predicting the results, the KNN algorithm needs to find the distance between all the data points in the existing dataset and the new dataset. Due to this, computational costs keep increasing as the data size increases. Thus, it might not be computationally efficient for use cases with large data sets.
- High memory usage: KNN regression requires a lot of memory to store the entire training dataset, which can be a problem for very large datasets.
- Hyperparameter sensitivity: The performance of KNN regression is highly dependent on the choice of the hyperparameter K, which determines the number of nearest neighbors used to make the prediction. Choosing the wrong value of K can result in overfitting or underfitting.
- Sensitivity to irrelevant features: KNN regression is sensitive to irrelevant features, as they can have a large impact on the distance metric used to identify the nearest neighbors. This can result in poor performance if the features are not carefully pre-processed.
- Non-parametric nature: Unlike other regression algorithms, KNN regression does not provide a model that can be used to make predictions for new data points. Every time, it calculates the distance and then gives the results. This can make it more difficult to interpret the results and understand the relationships between the features and target variables.
- Not suitable for large datasets with many features: KNN regression can become computationally infeasible for datasets with a large number of features and data points. In these cases, you can use algorithms like multiple regression or polynomial regression.
Conclusion
In this article, we have discussed the basics of the K-nearest neighbors regression algorithm. We have also used a numerical example to understand the KNN regression algorithm in a better manner. Finally, we also discussed the applications, advantages, and disadvantages of the K-Nearest neighbors regression algorithm.
To learn more about machine learning techniques, you can read this article on KNN Classification using the sklearn module in Python. You might also like this article on k-means clustering in python.