What’s the KNN?

Decision boundary of nearest neighbors decision on Iris dataset with 3 nearest neighbors

What is KNN Algorithm?

K-Nearest Neighbor (KNN)is a non-parametric supervised machine learning algorithm. (Supervised machine learning means that the machine learns to map an input to an output based on labeled training data.) It is one of the simplest algorithms used in machine learning for regression and classification. KNN follows the “birds of a feather” strategy in determining where the new data fits. KNN uses all available data and classifies the new data or case based on a similarity measure, or distance function. The new data is then assigned to the class to which most neighbors belong to.

Why is KNN called a “Lazy Learner”?

KNN is often referred to as a lazy learner. This means that the algorithm does not use the training data points to do any generalizations. In other words, there is no explicit training phase. Lack of generalization means that KNN keeps all the training data. It is a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

How to make predictions using KNN algorithm?

To classify an unknown instance represented by some feature vectors as a point in the feature space, the KNN classifier calculates the distances between the point and points in the training data set. Usually, the Euclidean distanceis used as the distance metric.

Find the mode of the labels the new data is closet to.

Distance Formulas

Below are other distance formulas used in the KNN algorithm.

Minkowski distance is a normed vector space that can be considered as a generalization of the Euclidean distance and the Manhattan distance.

Minkowski distance

Euclidean distance is the most commonly used distance formula when calculating KNN. Euclidean distance formula finds the shortest distance between points A and B. This is sometimes referred to as the Pythagorean Theorem

Euclidean distance

Another distance formula is the Manhattan distance formula. This formula takes the sum of the absolute values of the differences of the coordinates.

Euclidean distance

Other formulas include Chebyshev, Wminkowski, Seuclidean, Mahalanobis.

You can read this article about the importance of distance formula in calculating KNN. https://arxiv.org/pdf/1708.04321.pdf

How to choose the value of k?

The “K” in KNN stands for the number of nearest neighbors the model is using to figure out how to assign a class for your new data. This is an important parameter for achieving better accuracy of your model. While there is no structured method to find the best value for K, here are some suggestions.

Take the square root of N, where N is the number of samples in your training data (k=sqrt(N)).
Choose an odd number. This helps to avoid confusion between two classes of data.
Cross-Validation. Use a small subset of the training data, called the validation set, and test out different possible values of K. Then choose the K that yields the best results for the validation set.

Implementing the KNN Algorithm

The KNN model can be implemented in 4 easy steps. To demonstrate the KNN algorithm, I am using it on the iris dataset. This is a preloaded dataset from the sklearn library. The iris dataset includes three iris species — Setosa, Versicolor, Virginica, with 50 samples from each. For each sample, we have sepal length, width, and petal length, and width.

Machine Learning in R for beginners

1. Look at the Data

The KNN algorithm will use the relationships between the 4 features and to make predictions on the test data.

2D Scatterplot of Iris

Boxplot of Iris Features


    # Load Data
    # Import library
    from sklearn.datasets import load_iris

    iris = load_iris()

    # Separate into target and features
    # Scale features
    X = scale(iris.data)
    y = iris.target

1a. Initialize K to your chosen number of neighbors


    Instantiate model
    clf = KNN(k=13)
    I used 13 because this is approximately the sqrt(n) as mentioned above.

1b. Fit the model


    Fit on X_train and y_train
    clf.fit(X_train, y_train)

For the fit, KNN is just adding the training data to memory. There is no active training phase.

2. For each example in the data:

Calculate the distance between the query example and the current example from the data.
Add the distance and the index of the example to an ordered collection


    # Distance Formula
    def euclidean_distance(self, point1, point2, length):
         distance = 0

        for x in range(length):
              distance += (point1[x]-point2[x])**2

        return np.sqrt(distance)

    # Calculate distances between X_train and X_test
    def _get_distance(self, X_test):
        length = X_test.shape[1]

        # Initialize distance array
         distances = []

        for idx in range(len(X_test)):
            # record X_test id
            # initialize an array to hold distances
            distances.append([ X_test[idx], [] ])

            # Loop through each row in x_train
              for row in self.X_train:

                # find the euclidean distance
                # append to distance list
                dist = self.euclidean_distance(row, X_test[idx], length)
                distances[idx][1].append(dist)

        return distances

3. Find the neighbors

Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
Pick the first K entries from the sorted collection
Get the labels of the selected K entries


    # Get Predictions
    def _get_labels(self, distances):
         # Initialize y_pred array
         y_pred = []

        for row in range(len(distances)):

            # sort distances and record up to k values
            distance = distances[row]
            y_indices = np.argsort(distance[1])[:self.k]

            #  find the classes that correspond with nearest neighbors
            k_nearest_classes = [self.y_train[i%len(self.y_train)] for
                                  i in y_indices]

            # get label based on the mode of the classes
            label = [stats.mode(k_nearest_classes)][0][0][0]

            y_pred.append(label)

        return y_pred

4a. If classification, return the mode of the K labels.

4b. If regression, return the mean of the K labels


    def predict(self, X_test):
         distances = self._get_distance(X_test)
         return self._get_labels(distances)

    Predict on X_test
    predict = clf.predict(X_test)

    This returns:
    [1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
    0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0]

    Calculate accuracy
    accuracy_score(y_test, predict)

    KNN model accuracy: 1.0

Optimizing with K Values

You can check the error rate for different values of k. This will help determine if the choice of k is the best for your model.

How is KNN used?

KNN is a simple yet powerful algorithm since it requires no training for making predictions. Below are a few applications of the KNN algorithm.

Recommendation Systems: KNN is used for creating simple recommendations systems such as the ones used by Amazon or Netflix to make movie suggestions or suggest an item for purchase.
Credit Rating: KNN algorithms can be used to find an individual’s credit rating by making comparisons between a person’s financial information and comparing them to others with similar information.
Banking: Should a bank give a loan to an individual. Will the person default on his or her loan?
Other examples include handwriting detection, image recognition, and even video recognition.

When to use KNN?

KNN is a powerful algorithm but it is not the best algorithm for all machine learning models. So when do should we use it? KNN is useful for data that is noise-free, small, and labeled.

Pros and Cons of KNN

Pros

It is easy to use, only need k and distance formula.
Quick calculation time with relatively high accuracy. Since it is a lazy learner there is no training necessary.
Versatile, can be used for regression or classification.

Cons

Computationally expensive because it has to store all the training data and there is high memory storage required.
Sensitive to irrelevant features and the scale of the data.
Can be slow depending on the size of the data.

Summary

KNN is a simple, yet powerful and flexible, algorithm that can be used for both regression and classification problems.

4 Easy Steps:

Look at the data.
Calculate the distance.
Find neighbors.
Find the most common label.

. . .

Find Code:

Find the code on Github.

Specific Files:

Find the code for my KNN algorithm here.

Find the code for my KNN test as .py file here or as jupyter notebook here. (I compare my algorithm to KNeighborsClassifier, in the sklearn library).

Find my data visualizations of the iris dataset here.