K-Nearest Neighbour Algorithm: “The Magic Behind the Scenes”
By Aminah Mardiyyah Rufai
The K-Nearest Neighbour Algorithm:
A quite interesting algorithm, it is a Supervised Machine Learning Algorithm. Although it can be used for both Regression-based and Classification-based Machine learning problems, it finds more application as a Classification Algorithm. It is also known as an ‘instance-based’, ‘memory-based’, and a ‘non-parametric’ learning algorithm. For an understanding of what these terms mean, I have embedded the links to useful resources as well as included the links below for further reading.
The K-nearest neighbour (Also known as KNN) algorithm is often called ‘a lazy algorithm’ or a ‘lazy learner’. Since it is an instance-based model, it tends to memorise the data points. It does not exactly generate a discriminating function (used for creating decision boundaries) during training as does algorithms like Logistics regression or Support Vector Machine. It rather predicts classes using a distance measurement method and a voting system of the nearest/closest neighbour/data points. Hence its name — ‘K-nearest Neighbour’. This is also why it cannot exactly be classified as either a ‘Discriminative model’ or a ‘Generative model’. I have a separate article that explains the difference between a Discriminative and a Generative model. Find the link here.
In this article, the focus is on how KNN works as a classification Algorithm.
Fun Fact: Do you Knn was built in an attempt to mimic the human nature of memorising events when they occur?. The intuition is ‘Experience is the best teacher’. You know just how you see or experience an event and the results of the actions as they occur, and then automatically that sticks as a default setting for you?
KNN as a Classifier:
KNN carries out classification tasks by grouping data points using feature similarity. New Data points are then classified using closest/nearest neighbour distance. It stores/memorises all available cases in a datasets and further classifies new data points using a similarity measure with its nearest neighbour(s). The similarity is defined with a distance measurement between the new data point and its closest neighbours. A popular distance measure is the “Euclidean Distance”. Others are Minswoki, Manhattan, Hamming etc.
‘K’ in KNN is a parameter that signifies the number of neighbours/groupings into which data points should be grouped for the voting process. The K-value is typically a positive odd integer(1, 3, 5, 7, et cetera). If k=1, then the data point is grouped to a single nearest neighbour (closest by distance), if k=3 then it will be grouped to its three nearest neighbours. If these three neighbours belong to different groups, then voting will be done using the shortest distance calculated using any of its distance metrics.
The Pseudo-code for this process is as follows:
- Load the data
- Initialise k (Set a value for k or assume the default value by ScikitLearn)
- For each data point in the Test set(data to be predicted):
- Calculate the distance between the test data point and closest train data point using the value of k ( if k = 3, for example, calculate the distance between 3 closest points). Distance metrics can be either of Euclidean distance, Manhattan, Minswoki etc (Minswoki is the default in Sklearn).
- Store the calculated distances to an ordered list in ascending order (smallest to largest)
4. Choose ‘k’ entries from the sorted list ( if k =3, 3 values will be picked)
5. Voting is done on majority classes present ( Example if 2 out of 3 neighbours belong to the same class then the majority class is voted for)
6. Predicted class is returned based on the majority vote.
So if for example, we have a class to be predicted as seen in the image below:
Let’s say the initial value of k=1, the first single and closest neighbour by distance is located from the train data points, the New data is then predicted to be in the same class as the train data point. However, if k = 3, the three closest neighbours by distance is located and voting for majority class present is carried out, the test data automatically gets classified in the same class as the majority class.
Similarly, the same process for k=5, k=7 et cetera.
Curious about why I seem to only mention odd values of k?
CHOOSING K VALUES
So typically, we do not want to confuse the algorithm or create a scenario where the decision of which class a test data point will belong to becomes difficult. Assume we pick an even number of k, say ‘k=4’ for example and these 4 closest neighbours, 2 each belong to the same class, with the distance calculated also approximately the same, how then will voting be carried out?. Hence, the reason why k values are best chosen as odd numbers, especially when the class labels are even in number(e.g binary class).
One major point of concern for this algorithm is, ‘ What is the right k value to pick/use for a given use case?’.
Let’s take a look at the example below. Notice the difference in predicted class for different values of k.
When k= 3, the predicted class is ‘Red’, based on the majority vote. However, when k = 5, predicted class changes to ‘Blue’.
How then can we choose the right ‘k-value’ to avoid mis-classification problems?
There are two most common methods for choosing k values in machine learning.
- Square Root Method: Using the square root of the number of samples in the training datasets.
- The Cross-Validation Method: Using cross-validation to find out the optimal value of K in KNN.
DISTANCE METRICS USED
Some Common Distance Metrics Used in KNN
As mentioned earlier, there are several distance metrics used in KNN. Some of the common ones are:
- The Euclidean Distance: This is one common method we all have used at some point in high school or college. It is calculated using the following formula:
- The Manhattan Distance: This is slightly different from the Euclidean distance. It is the distance between two points measured along axes at right angles.
So while the Euclidean formula calculates the distance between points by Taking the square root of the sum of the squares of the differences of the coordinates, The Manhattan formula calculates the distance by taking the sum of the absolute values of the differences of the coordinates.
Difference Between Euclidean and Manhattan
- The Minkowski distance: This is more like a generalised form of the Euclidean and the Manhattan distance. Given by the formula below:
There are several other distance metrics used, reference the links below for in-depth knowledge.
PROS AND CONS OF THE KNN ALGORITHM
- It is one of the simplest machine learning algorithms to use.
- Easy to implement and understand
- Gives reasonable performance
- Serves as a good baseline model
- Can be used for both classification and regression problems
- Because it is a memory-based algorithm, it tends to take longer training time with larger datasets.
- Does not respond properly with outliers present in the data
- Inability to handle too many features in the datasets
The focus of this article was to explain some basic concepts around how the K-nearest neighbour algorithm works behind the scene when using the library. For further study, reference the links below.
I have an article on a simple use-case using this algorithm. Please find the link below.
Thank you for reading!
This write-up originally appeared in this post.
Aminah Rufai is a Data Scientist
Follow her on social media:
Become a Microsoft Certified Analyst today, sign up for an intensive training