k-Nearest Neighbors in scikit-learn

The k-Nearest Neighbor (KNN) algorithm can be used for classification or regression. It’s one of the simpler machine learning algorithms, but it’s usually pretty effective.

It is non-parametric, meaning it doesn’t have a fixed number of parameters.

When you train KNN, it finds the k “nearest” points to a given point, and returns the class with the highest proportion. If K = 1, then you only look for the closest point and return it’s class. This is not ideal. The optimal K value for KNN is usually between 3-10.

Distance

Euclidean Distance (pictured below) between test input x and the training examples is the typical distance measure. Other distance metrics are also used, this this is the most common.

UnknownNote – kNN assumes we are in a metric space, meaning one unit increase in shoe size is equivalent to one unit increase in height. When you use this algorithm outside of this tutorial, put your features on the same scale.

The Curse of Dimensionality

KNN fails when we have a ton of features. In particular, noisy or irrelevant features. In a sense, noise makes two points that would have been close to each other farther apart. One can use tools such as PCA to reduce dimensionality, and this is a good practice if you have more than 10 or so features.

Example – Predicting Grade Level

I made some data for shoe sizes and heights or 4th graders vs seniors and high school, so we could have a simple classification example. Plus I’m tired of the iris dataset.

You can download the data and the iPython notebook on my github.

The dataset we’re working with has two features—shoe size and height. It also has the class, 4th grade, or senior. Red is fourth grade, blue is senior.

ClsGrade

import pandas as pd
import numpy as np

train = pd.read_csv('train.csv') # better be in the correct directory!
test = pd.read_csv('test.csv')

train.head()

Screen Shot 2015-07-11 at 9.31.36 PM

We have our data in a pandas data frame. We need it to be in a numpy array for the algorithm to take it. I like to split it up into trainArr which has the training features and trainRes, which contains the correct output. I do the same thing to the test data, just for convenience.

from sklearn.neighbors import KNeighborsClassifier

cols = ['shoe size', 'height']
cols2 = ['class']
trainArr = train.as_matrix(cols)
trainRes = train.as_matrix(cols2)

testArr = test.as_matrix(cols)
testRes = test.as_matrix(cols2)

knn = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn.fit(trainArr, trainRes)

output = knn.predict(testArr)

# or predict on a specific example!
print(knn.predict(testArr[0]))
print(testRes[0])                # a match!

['seniors']
['seniors']

Let’s get a better overview to see how correct our classifier was.

correct = 0.0

for i in range(len(output)): 
    if testRes[i][0] == output[i]: 
        correct += 1
    
correct / len(output)

0.9933333333333333

Well not perfect, but pretty good! One more thing I want to get across in this blog post.

A reality check about scikit-learn

Scikit-learn is amazing. We essentially create and train a classifier in two lines of code. Don’t let this make you lazy.

In a harder classification problem, even using scikit-learn, there are knobs to turn in the algorithm. In this case it would be k and weight. Maybe we need to normalize our data, maybe we can set some heuristic bounds using common sense.

We have to keep the bigger picture in mind, and not let even the simplest algorithms become a black box.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s