KNN Classification Using sklearn Module in Python

Classification is often used in machine learning to derive solutions to different business problems. In this article, we will discuss the implementation of the KNN classification algorithm using the sklearn module in Python.

What is KNN Classification Algorithm?

KNN (K-Nearest Neighbors) is a popular machine-learning algorithm for classification tasks. The basic idea behind the KNN algorithm is to find the K data points in a training set that are closest to a new data point. Then the algorithm classifies the new data point based on the majority class of its K nearest neighbors.

To learn about the features, advantages, disadvantages, and applications of the K-Nearest Neighbors classification algorithm, you can read this article on KNN classification numerical example.

KNN Classification Algorithm

KNN (K-Nearest Neighbors) is a simple and powerful classification algorithm that is based on the principle of instance-based learning. The main idea behind KNN is to classify a new data point based on the class labels of its closest neighbors in the training data.

The algorithm consists of the following steps:

  1. First, we choose the number of nearest neighbors, K.
  2. Then, we calculate the distance between the new data point and all the points in the training data.
  3. Next, we select the K training points that are closest to the new data point.
  4. Finally, we determine the class label of the new data point based on the majority class of its K nearest neighbors.

The distance metric used in KNN can be any standard distance measure, such as Euclidean distance, Manhattan distance, or Minkowski distance.

The KNeighborsClassifier() Function

The KNeighborsClassifier() function defined in the sklearn module is used to perform KNN classification. It has the following syntax.

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None) 

Here, 

  • The n_neighbors parameter is used to decide the number of neighbors to consider while classifying a new data point. 
  • The weights parameter is used to decide the weightage of neighbors for the given sample in the dataset. By default, it is “uniform” denoting that all neighbors are weighted equally. 
  • You can set the weights parameter to ‘distance’ if you want to weigh points by the inverse of their distance. In this case, closer neighbors of a data point will have a greater influence than neighbors that are further away.
  • You can also pass a user-defined function that accepts an array of distances, and returns an array of the same shape containing the weights. In this way, you can explicitly define the weight of a neighbor as a function of its distance from the data point.
  • The algorithm parameter is used to compute the nearest neighbors. When it is set to “auto”, the function will attempt to decide the most appropriate algorithm for neighbor calculation based on the training data.
  • The metric parameter is used to decide the metric for computing distances between the data points. By default, the metric parameter is set to “minkowski”
    • If the metric parameter is set to “precomputed”, the input given to the fit() method must be a distance matrix and must be square-shaped. You can set the metric parameter to “precomputed” to perform KNN classification for categorical and mixed data types. 
    • You can also pass a function to the metric parameter. In this case, the function must take two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. I.e., the function should calculate the distance between any two data points in the dataset.
  • The n_jobs parameter is used to decide the number of parallel executions of the KNN classification algorithm. By default, it is set to None. This means that the function runs only one job. If n_jobs is set to -1, the KNN classifier defined in the sklearn module executes using all the processors. You can also set a specific number of jobs using the n_jobs parameter. 

After execution, the KNeighborsClassifier() function returns an untrained KNeighborsClassifier object. We can train this untrained classifier using the fit() method.

The fit() method takes the training data points as its first input argument and their class labels as its second input argument. After execution, it returns a trained sklearn KNN classifier. You can use this trained model to predict class labels for new data points. 

KNN Classification Using the Sklearn Module in Python

To perform KNN classification using the sklearn module in python, we will use the following dataset.

PointCoordinatesClass Label
A1(2,10)C2
A2(2, 6)C1
A3(11,11)C3
A4(6, 9)C2
A5(6, 5)C1
A6(1, 2)C1
A7(5, 10)C2
A8(4, 9)C2
A9(10, 12)C3
A10(7, 5)C1
A11(9, 11)C3
A12(4, 6)C1
A13(3, 10)C2
A15(3, 8)C2
A15(6, 11)C2
KNN classification dataset

The above dataset contains 15 data points and has three class labels. We will build the KNN classifier using the sklearn module using these data points.

Here, we have clean data with no noise or outliers. In real-world data, you won’t get such quality. Therefore, you might need to perform data preprocessing steps such as data cleaning, normalization, handling missing values, removing bias from the data, and others.

In this article, let us just use the above dataset to understand how the KNN classification algorithm works using the sklearn module in Python. 

To perform KNN classification using the above dataset, we will use the following steps.

  • First, we will create a list of data points and another list of class labels of the data points. We name this list of data points as data_points and the list of class labels as class_labels.
  • Next, we will create an untrained KNN classifier using the KNeighborsClassifier() method defined in the sklearn module. Here, we will take the number of neighbors n_neighbors to 3. We will also set the metric parameter to “euclidean” to use euclidean distance as the distance metric.
  • Then, we will train the KNN classifier model using the fit() method. The fit() method takes the list of data points as its first input argument and the list of class labels as its second input argument. After execution, the fit() method will return the trained machine-learning model. For datasets having multiple attributes, you can also pass the dataframe containing attributes as its first input argument and a series or list of class labels for the data points as the second input argument. 
  • Once we get the trained machine learning model for KNN classification, we can use the predict() method to predict class labels. The predict() method, when invoked on the trained KNN classifier, takes a list of data points as its input argument. After execution, it returns a list of class labels containing the labels for each data point in the input.

You can observe the entire process in the following example.

from sklearn.neighbors import KNeighborsClassifier
#create list of data points
data_points=[(2,10),(2, 6),(11,11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9),(10, 12),(7, 5),(9, 11),(4, 6), (3, 10), (3, 8),(6, 11)]
#create list of class labels
class_labels=["C2","C1","C3", "C2","C1","C1","C2","C2","C3","C1","C3","C1","C2","C2","C2"]
#create untrined model
untrained_model=KNeighborsClassifier(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,class_labels)
#predict class for model
predicted_class=trained_model.predict([(5,7)])
print("The data points are:")
print(data_points)
print("The class labels are:")
print(class_labels)
print("The predicted class label for (5,7) is:")
print(predicted_class)

Output:

The data points are:
[(2, 10), (2, 6), (11, 11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9), (10, 12), (7, 5), (9, 11), (4, 6), (3, 10), (3, 8), (6, 11)]
The class labels are:
['C2', 'C1', 'C3', 'C2', 'C1', 'C1', 'C2', 'C2', 'C3', 'C1', 'C3', 'C1', 'C2', 'C2', 'C2']
The predicted class label for (5,7) is:
['C1']

In this example, we have implemented the KNN classification algorithm using the sklearn module. Then, we used the trained model to predict the class label for the data point (5,7).

Find Class Labels in The sklearn KNN Classifier

To find the properties of a KNN classifier, we can use the attributes of the trained machine-learning model.

To find the class labels in the KNN classifier, we can use the classes_ parameter. The classes_ parameter, when invoked on the trained classifier, returns all the classes in the k-nearest neighbors classification model. You can observe this in the following example.

from sklearn.neighbors import KNeighborsClassifier
#create list of data points
data_points=[(2,10),(2, 6),(11,11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9),(10, 12),(7, 5),(9, 11),(4, 6), (3, 10), (3, 8),(6, 11)]
#create list of class labels
class_labels=["C2","C1","C3", "C2","C1","C1","C2","C2","C3","C1","C3","C1","C2","C2","C2"]
#create untrined model
untrained_model=KNeighborsClassifier(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,class_labels)
#predict class for model
predicted_class=trained_model.predict([(5,7)])
print("The data points are:")
print(data_points)
print("The class labels are:")
print(class_labels)
print("The class labels in the model are:")
print(trained_model.classes_)

Output:

The data points are:
[(2, 10), (2, 6), (11, 11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9), (10, 12), (7, 5), (9, 11), (4, 6), (3, 10), (3, 8), (6, 11)]
The class labels are:
['C2', 'C1', 'C3', 'C2', 'C1', 'C1', 'C2', 'C2', 'C3', 'C1', 'C3', 'C1', 'C2', 'C2', 'C2']
The class labels in the model are:
['C1' 'C2' 'C3']

In this example, you can observe that there are three distinct classes in the training data. Hence, the classes_ attribute of the KNN model returns the list [‘C1’ ‘C2’ ‘C3’].

Find the Number of Training Samples in the KNN Classifier

You can also find the number of data points used while training the K-Nearest neighbors classifier. For this, you can use the n_samples_fit_ attribute of the trained model. The n_samples_fit_ attribute contains the number of training samples passed to the fit() method. You can observe this in the following example.

from sklearn.neighbors import KNeighborsClassifier
#create list of data points
data_points=[(2,10),(2, 6),(11,11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9),(10, 12),(7, 5),(9, 11),(4, 6), (3, 10), (3, 8),(6, 11)]
#create list of class labels
class_labels=["C2","C1","C3", "C2","C1","C1","C2","C2","C3","C1","C3","C1","C2","C2","C2"]
#create untrined model
untrained_model=KNeighborsClassifier(n_neighbors=3, metric="euclidean")
#train model using fit method
trained_model=untrained_model.fit(data_points,class_labels)
#predict class for model
predicted_class=trained_model.predict([(5,7)])
print("The data points are:")
print(data_points)
print("The class labels are:")
print(class_labels)
print("The number of data points in training is:")
print(trained_model.n_samples_fit_)

Output:

The data points are:
[(2, 10), (2, 6), (11, 11), (6, 9), (6, 5), (1, 2), (5, 10), (4, 9), (10, 12), (7, 5), (9, 11), (4, 6), (3, 10), (3, 8), (6, 11)]
The class labels are:
['C2', 'C1', 'C3', 'C2', 'C1', 'C1', 'C2', 'C2', 'C3', 'C1', 'C3', 'C1', 'C2', 'C2', 'C2']
The number of data points in training are:
15

We have passed 15 data points to the fit() method. Hence, you can observe that the n_samples_fit_ attribute of the trained KNN model contains the value 15.

Conclusion

In this article, we have discussed the K-Nearest Neighbors classification algorithm using the sklearn module in Python. We also saw how to determine different attributes of a trained KNN classifier created using the KNeighborsClassifier() function defined in the sklearn module.

To learn more machine learning concepts, you can read this article on K-Means clustering numerical example. You might also like this article on how to find the best k in k-prototypes clustering.

I hope you enjoyed reading this article. Stay tuned for more informative articles!

Happy Learning!

Similar Posts