KNN Classification From Scratch in Python

The k-Nearest Neighbors classification algorithm is one of the most useful but simplest algorithms to implement. In this article, we will implement KNN classification from scratch in Python.

What is the K-Nearset Neighbors Classification Algorithm?

K-Nearest Neighbors (KNN) is a popular classification algorithm in machine learning that belongs to the family of instance-based learning or lazy learning algorithms. KNN is a simple, non-parametric, and easy-to-understand algorithm that is often used for solving classification problems in machine learning.

In the KNN algorithm, the classification of a new instance is based on the majority class of its K nearest neighbors in the training data. K is a hyperparameter that represents the number of nearest neighbors to consider. The KNN algorithm works as follows:

  1. Calculate the distance between the new instance and all the instances in the training data using a distance metric such as Euclidean distance, Manhattan distance, or Minkowski distance.
  2. Select the K data points from the training dataset that are closest to the new data point based on the calculated distance.
  3. Classify the new data point by assigning it to the majority class among its K nearest neighbors. In the case of a tie, the class is randomly assigned.
  4. Repeat steps 1-3 for all new data points that we need to classify.

KNN is a simple and intuitive algorithm that works well on small to medium-sized datasets with a small number of classes. However, it can be computationally expensive for large datasets and high-dimensional feature spaces. In addition, the KNN algorithm requires the use of a distance metric, which may not be suitable for all types of data.

To learn the algorithm, advantages, and disadvantages of KNN classification, you can read this article on KNN classification numerical example. You can also read this article on the implementation of the KNN classification using the sklearn module in python

Let us now implement the KNN classification algorithm from scratch in python.

How to Implement KNN classification from scratch in Python?

To implement the KNN classification algorithm from scratch in python, we will use the following steps.

  • First, we will load the training dataset into the program and separate the features and class labels.
  • Next, we will calculate the distance between the new data point and all the existing data points in the training data set.
  • After this, we will select the k nearest neighbors of the new data point in the training set.
  • Finally, we will find the class label for the new data point using the majority of the class labels of k nearest neighbors.

Let us implement each step one by one.

Load the Training Dataset

For this example, we will use the following dataset.

PointX_CoordinateY_CoordinateClass Label
A1210C2
A226C1
A31111C3
A469C2
A565C1
A612C1
A7510C2
A849C2
A91012C3
A1075C1
A11911C3
A1246C1
A13310C2
A1538C2
A15611C2
Dataset for KNN classification

This dataset contains 15 data points with their coordinates and class labels. Now, let us predict the class label for a new data point (5, 7) by implementing KNN classification from scratch in python.

We have saved the training dataset in the following CSV file.

To load the dataset into our program, we will use the read_csv() function defined in the pandas module. The read_csv() function takes the name of the CSV file as its input argument and returns a data frame as shown below.

import pandas as pd
training_data=pd.read_csv("KNN_Dataset.csv")
print("The training data is:")
print(training_data)

Output:

The training data is:
   Point  X_Coordinate  Y_Coordinate Class Label
0     A1             2            10          C2
1     A2             2             6          C1
2     A3            11            11          C3
3     A4             6             9          C2
4     A5             6             5          C1
5     A6             1             2          C1
6     A7             5            10          C2
7     A8             4             9          C2
8     A9            10            12          C3
9    A10             7             5          C1
10   A11             9            11          C3
11   A12             4             6          C1
12   A13             3            10          C2
13   A15             3             8          C2
14   A15             6            11          C2

In the above dataset, the name of the points is not a data attribute but an identifier for the points. So, we will convert the Point column of the dataframe to an index. For this, we will use the set_index() method. The set_index() method, when invoked on a dataframe, takes a column name as its input argument. After execution, it converts the column into an index column and returns the modified dataframe. You can observe this in the following example.

training_data=training_data.set_index("Point")
print("The training data is:")
print(training_data)

Output:

The training data is:
       X_Coordinate  Y_Coordinate Class Label
Point                                        
A1                2            10          C2
A2                2             6          C1
A3               11            11          C3
A4                6             9          C2
A5                6             5          C1
A6                1             2          C1
A7                5            10          C2
A8                4             9          C2
A9               10            12          C3
A10               7             5          C1
A11               9            11          C3
A12               4             6          C1
A13               3            10          C2
A15               3             8          C2
A15               6            11          C2

The above dataset is pretty simple and doesn’t require any data cleaning. If your data set contains missing values or any other abnormalities, you should perform data preprocessing first to make sure that the data is suitable for classification tasks.

Now, we will extract the attributes columns and the class labels from the training dataset as shown below.

data_points=training_data[["X_Coordinate","Y_Coordinate"]].applymap(int)
print("The data points are:")
print(data_points)
class_labels=training_data["Class Label"]
print("The class labels are:")
print(class_labels)

Output:

The data points are:
       X_Coordinate  Y_Coordinate
Point                            
A1                2            10
A2                2             6
A3               11            11
A4                6             9
A5                6             5
A6                1             2
A7                5            10
A8                4             9
A9               10            12
A10               7             5
A11               9            11
A12               4             6
A13               3            10
A15               3             8
A15               6            11
The class labels are:
Point
A1     C2
A2     C1
A3     C3
A4     C2
A5     C1
A6     C1
A7     C2
A8     C2
A9     C3
A10    C1
A11    C3
A12    C1
A13    C2
A15    C2
A15    C2
Name: Class Label, dtype: object

Calculate the Distance Between the New Data Point and the Points in the Training Data

Now, we will calculate the distance between the new data point and the existing data points in the training data. For this, we will define a function that takes the new data point and the training dataset as its input argument and executes the following steps.

  • First, we will define an empty list named “distances” to store the distance of the new data point from existing data points.
  • Next, we will iterate through the rows of the training data using the iterrows() function. While iterating, we will extract the coordinates of the data points.
  • Once we get the coordinates of the data points in the training data, we will calculate the distance between the data point in the training data and the new data point. For this, we will use the dist() function defined in the math module. The dist() function takes two data points as its input arguments and returns the euclidean distance between them.
  • We will save the distances in the list “distances” and return this list from the function.

Once we get the list of distances, we will insert it into the training dataset as a column. You can observe this in the following code.

import math
new_data_point=(5,7)
def distance_calculation(data_points,new_data_point):
    distances=list()
    for index,row in data_points.iterrows():
        point=(row["X_Coordinate"],row["Y_Coordinate"])
        distance=math.dist(point, new_data_point)
        distances.append(distance)
    return distances
distance_list=distance_calculation(data_points,new_data_point)
training_data["distances"]=distance_list
print("The dataset with distances from new data point is:")
print(training_data)

Output:

The dataset with distances from new data point is:
       X_Coordinate  Y_Coordinate Class Label  distances
Point                                                   
A1                2            10          C2   4.242641
A2                2             6          C1   3.162278
A3               11            11          C3   7.211103
A4                6             9          C2   2.236068
A5                6             5          C1   2.236068
A6                1             2          C1   6.403124
A7                5            10          C2   3.000000
A8                4             9          C2   2.236068
A9               10            12          C3   7.071068
A10               7             5          C1   2.828427
A11               9            11          C3   5.656854
A12               4             6          C1   1.414214
A13               3            10          C2   3.605551
A15               3             8          C2   2.236068
A15               6            11          C2   4.123106

Select K Nearest Neighbors of The New Data Point

In the next step to implement KNN classification from scratch in python, we will find the k nearest neighbors of the new data point. As we have already calculated the distance between the data points in the training data and the new data point, we will sort the training data according to their distance from the new data point.

To sort the training data, we will use the sort_values() method. The sort_values() method, when invoked on the training dataframe, takes the take the name of distance column as its input argument and returns the sorted dataframe by distance. 

After sorting, we will select the k-nearest data points from the sorted data using the iloc attribute of the dataframe. You can observe this in the following code.

training_data=training_data.sort_values(by="distances")
print("The dataset sorted by distance from new data point is:")
print(training_data)
k=3
nearest_data_points=training_data.iloc[0:k]
print("The nearest data points are:")
print(nearest_data_points)

Output:

The dataset sorted by distance from new data point is:
       X_Coordinate  Y_Coordinate Class Label  distances
Point                                                   
A12               4             6          C1   1.414214
A4                6             9          C2   2.236068
A5                6             5          C1   2.236068
A8                4             9          C2   2.236068
A15               3             8          C2   2.236068
A10               7             5          C1   2.828427
A7                5            10          C2   3.000000
A2                2             6          C1   3.162278
A13               3            10          C2   3.605551
A15               6            11          C2   4.123106
A1                2            10          C2   4.242641
A11               9            11          C3   5.656854
A6                1             2          C1   6.403124
A9               10            12          C3   7.071068
A3               11            11          C3   7.211103
The nearest data points are:
       X_Coordinate  Y_Coordinate Class Label  distances
Point                                                   
A12               4             6          C1   1.414214
A4                6             9          C2   2.236068
A5                6             5          C1   2.236068

Find The Class Label of The New Data Point

In the final step to implement the KNN classification algorithm from scratch in python, we have to find the class label of the new data point. For this, we will select the class labels of the k-nearest data points. Then, we will find the mode of the class labels. For this, we will use the mode() function defined in the statistics module.

The mode() function takes the class labels of the k nearest data points and returns the mode i.e. the most frequent class label. We will assign the class label returned by the mode() function to the new data point. 

You can observe this in the following example.

neighbor_class_labels=nearest_data_points["Class Label"]
print("The k nearest class labels are")
print(neighbor_class_labels)
import statistics
predicted_class_label=statistics.mode(neighbor_class_labels)
print("Predicted class label for new data point is:")
print(predicted_class_label)

Output:

The k nearest class labels are
Point
A12    C1
A4     C2
A5     C1
Name: Class Label, dtype: object
Predicted class label for new data point is:
C1

Hence, point (5, 7) is assigned the class C1 after implementing the KNN classification algorithm from scratch in Python.

Conclusion

In this article, we have discussed the implementation of the KNN classification algorithm from scratch in Python. To learn more about machine learning algorithms, you can read this article on k-means clustering numerical example. You might also like this article on KNN regression using the sklearn module in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts