KNN Classification From Scratch in Python
The k-Nearest Neighbors classification algorithm is one of the most useful but simplest algorithms to implement. In this article, we will implement KNN classification from scratch in Python.
What is the K-Nearset Neighbors Classification Algorithm?
K-Nearest Neighbors (KNN) is a popular classification algorithm in machine learning that belongs to the family of instance-based learning or lazy learning algorithms. KNN is a simple, non-parametric, and easy-to-understand algorithm that is often used for solving classification problems in machine learning.
In the KNN algorithm, the classification of a new instance is based on the majority class of its K nearest neighbors in the training data. K is a hyperparameter that represents the number of nearest neighbors to consider. The KNN algorithm works as follows:
- Calculate the distance between the new instance and all the instances in the training data using a distance metric such as Euclidean distance, Manhattan distance, or Minkowski distance.
- Select the K data points from the training dataset that are closest to the new data point based on the calculated distance.
- Classify the new data point by assigning it to the majority class among its K nearest neighbors. In the case of a tie, the class is randomly assigned.
- Repeat steps 1-3 for all new data points that we need to classify.
KNN is a simple and intuitive algorithm that works well on small to medium-sized datasets with a small number of classes. However, it can be computationally expensive for large datasets and high-dimensional feature spaces. In addition, the KNN algorithm requires the use of a distance metric, which may not be suitable for all types of data.
To learn the algorithm, advantages, and disadvantages of KNN classification, you can read this article on KNN classification numerical example. You can also read this article on the implementation of the KNN classification using the sklearn module in python.
Let us now implement the KNN classification algorithm from scratch in python.
How to Implement KNN classification from scratch in Python?
To implement the KNN classification algorithm from scratch in python, we will use the following steps.
- First, we will load the training dataset into the program and separate the features and class labels.
- Next, we will calculate the distance between the new data point and all the existing data points in the training data set.
- After this, we will select the k nearest neighbors of the new data point in the training set.
- Finally, we will find the class label for the new data point using the majority of the class labels of k nearest neighbors.
Let us implement each step one by one.
Load the Training Dataset
For this example, we will use the following dataset.
Point | X_Coordinate | Y_Coordinate | Class Label |
A1 | 2 | 10 | C2 |
A2 | 2 | 6 | C1 |
A3 | 11 | 11 | C3 |
A4 | 6 | 9 | C2 |
A5 | 6 | 5 | C1 |
A6 | 1 | 2 | C1 |
A7 | 5 | 10 | C2 |
A8 | 4 | 9 | C2 |
A9 | 10 | 12 | C3 |
A10 | 7 | 5 | C1 |
A11 | 9 | 11 | C3 |
A12 | 4 | 6 | C1 |
A13 | 3 | 10 | C2 |
A15 | 3 | 8 | C2 |
A15 | 6 | 11 | C2 |
This dataset contains 15 data points with their coordinates and class labels. Now, let us predict the class label for a new data point (5, 7) by implementing KNN classification from scratch in python.
We have saved the training dataset in the following CSV file.
To load the dataset into our program, we will use the read_csv()
function defined in the pandas module. The read_csv()
function takes the name of the CSV file as its input argument and returns a data frame as shown below.
import pandas as pd
training_data=pd.read_csv("KNN_Dataset.csv")
print("The training data is:")
print(training_data)
Output:
The training data is:
Point X_Coordinate Y_Coordinate Class Label
0 A1 2 10 C2
1 A2 2 6 C1
2 A3 11 11 C3
3 A4 6 9 C2
4 A5 6 5 C1
5 A6 1 2 C1
6 A7 5 10 C2
7 A8 4 9 C2
8 A9 10 12 C3
9 A10 7 5 C1
10 A11 9 11 C3
11 A12 4 6 C1
12 A13 3 10 C2
13 A15 3 8 C2
14 A15 6 11 C2
In the above dataset, the name of the points is not a data attribute but an identifier for the points. So, we will convert the Point column of the dataframe to an index. For this, we will use the set_index()
method. The set_index()
method, when invoked on a dataframe, takes a column name as its input argument. After execution, it converts the column into an index column and returns the modified dataframe. You can observe this in the following example.
training_data=training_data.set_index("Point")
print("The training data is:")
print(training_data)
Output:
The training data is:
X_Coordinate Y_Coordinate Class Label
Point
A1 2 10 C2
A2 2 6 C1
A3 11 11 C3
A4 6 9 C2
A5 6 5 C1
A6 1 2 C1
A7 5 10 C2
A8 4 9 C2
A9 10 12 C3
A10 7 5 C1
A11 9 11 C3
A12 4 6 C1
A13 3 10 C2
A15 3 8 C2
A15 6 11 C2
The above dataset is pretty simple and doesn’t require any data cleaning. If your data set contains missing values or any other abnormalities, you should perform data preprocessing first to make sure that the data is suitable for classification tasks.
Now, we will extract the attributes columns and the class labels from the training dataset as shown below.
data_points=training_data[["X_Coordinate","Y_Coordinate"]].applymap(int)
print("The data points are:")
print(data_points)
class_labels=training_data["Class Label"]
print("The class labels are:")
print(class_labels)
Output:
The data points are:
X_Coordinate Y_Coordinate
Point
A1 2 10
A2 2 6
A3 11 11
A4 6 9
A5 6 5
A6 1 2
A7 5 10
A8 4 9
A9 10 12
A10 7 5
A11 9 11
A12 4 6
A13 3 10
A15 3 8
A15 6 11
The class labels are:
Point
A1 C2
A2 C1
A3 C3
A4 C2
A5 C1
A6 C1
A7 C2
A8 C2
A9 C3
A10 C1
A11 C3
A12 C1
A13 C2
A15 C2
A15 C2
Name: Class Label, dtype: object
Calculate the Distance Between the New Data Point and the Points in the Training Data
Now, we will calculate the distance between the new data point and the existing data points in the training data. For this, we will define a function that takes the new data point and the training dataset as its input argument and executes the following steps.
- First, we will define an empty list named “
distances
” to store the distance of the new data point from existing data points. - Next, we will iterate through the rows of the training data using the
iterrows()
function. While iterating, we will extract the coordinates of the data points. - Once we get the coordinates of the data points in the training data, we will calculate the distance between the data point in the training data and the new data point. For this, we will use the
dist()
function defined in the math module. Thedist()
function takes two data points as its input arguments and returns the euclidean distance between them. - We will save the distances in the list
“distances”
and return this list from the function.
Once we get the list of distances, we will insert it into the training dataset as a column. You can observe this in the following code.
import math
new_data_point=(5,7)
def distance_calculation(data_points,new_data_point):
distances=list()
for index,row in data_points.iterrows():
point=(row["X_Coordinate"],row["Y_Coordinate"])
distance=math.dist(point, new_data_point)
distances.append(distance)
return distances
distance_list=distance_calculation(data_points,new_data_point)
training_data["distances"]=distance_list
print("The dataset with distances from new data point is:")
print(training_data)
Output:
The dataset with distances from new data point is:
X_Coordinate Y_Coordinate Class Label distances
Point
A1 2 10 C2 4.242641
A2 2 6 C1 3.162278
A3 11 11 C3 7.211103
A4 6 9 C2 2.236068
A5 6 5 C1 2.236068
A6 1 2 C1 6.403124
A7 5 10 C2 3.000000
A8 4 9 C2 2.236068
A9 10 12 C3 7.071068
A10 7 5 C1 2.828427
A11 9 11 C3 5.656854
A12 4 6 C1 1.414214
A13 3 10 C2 3.605551
A15 3 8 C2 2.236068
A15 6 11 C2 4.123106
Select K Nearest Neighbors of The New Data Point
In the next step to implement KNN classification from scratch in python, we will find the k nearest neighbors of the new data point. As we have already calculated the distance between the data points in the training data and the new data point, we will sort the training data according to their distance from the new data point.
To sort the training data, we will use the sort_values()
method. The sort_values()
method, when invoked on the training dataframe, takes the take the name of distance column as its input argument and returns the sorted dataframe by distance.
After sorting, we will select the k-nearest data points from the sorted data using the iloc attribute of the dataframe. You can observe this in the following code.
training_data=training_data.sort_values(by="distances")
print("The dataset sorted by distance from new data point is:")
print(training_data)
k=3
nearest_data_points=training_data.iloc[0:k]
print("The nearest data points are:")
print(nearest_data_points)
Output:
The dataset sorted by distance from new data point is:
X_Coordinate Y_Coordinate Class Label distances
Point
A12 4 6 C1 1.414214
A4 6 9 C2 2.236068
A5 6 5 C1 2.236068
A8 4 9 C2 2.236068
A15 3 8 C2 2.236068
A10 7 5 C1 2.828427
A7 5 10 C2 3.000000
A2 2 6 C1 3.162278
A13 3 10 C2 3.605551
A15 6 11 C2 4.123106
A1 2 10 C2 4.242641
A11 9 11 C3 5.656854
A6 1 2 C1 6.403124
A9 10 12 C3 7.071068
A3 11 11 C3 7.211103
The nearest data points are:
X_Coordinate Y_Coordinate Class Label distances
Point
A12 4 6 C1 1.414214
A4 6 9 C2 2.236068
A5 6 5 C1 2.236068
Find The Class Label of The New Data Point
In the final step to implement the KNN classification algorithm from scratch in python, we have to find the class label of the new data point. For this, we will select the class labels of the k-nearest data points. Then, we will find the mode of the class labels. For this, we will use the mode()
function defined in the statistics module.
The mode()
function takes the class labels of the k nearest data points and returns the mode i.e. the most frequent class label. We will assign the class label returned by the mode()
function to the new data point.
You can observe this in the following example.
neighbor_class_labels=nearest_data_points["Class Label"]
print("The k nearest class labels are")
print(neighbor_class_labels)
import statistics
predicted_class_label=statistics.mode(neighbor_class_labels)
print("Predicted class label for new data point is:")
print(predicted_class_label)
Output:
The k nearest class labels are
Point
A12 C1
A4 C2
A5 C1
Name: Class Label, dtype: object
Predicted class label for new data point is:
C1
Hence, point (5, 7) is assigned the class C1 after implementing the KNN classification algorithm from scratch in Python.
Conclusion
In this article, we have discussed the implementation of the KNN classification algorithm from scratch in Python. To learn more about machine learning algorithms, you can read this article on k-means clustering numerical example. You might also like this article on KNN regression using the sklearn module in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!