Assumption:
Similar Inputs have similar outputs
Classification rule:
For a test input , assign the most common label amongst its k most similar training inputs
A binary classification example with . The green point in the center is the test sample . The labels of the 3 neighbors are (+1) and (-1) resulting in majority predicting (+1).
Formal (and borderline incomprehensible) definition of k-NN:
Test point:
Denote the set of the nearest neighbors of as . Formally is defined as s.t. and ,
(i.e. every point in but not in is at least as far away from as the furthest point in ).
We can then define the classifier as a function returning the most common label in :
where means to select the label of the highest occurrence.
(Hint: In case of a draw, a good solution is to return the result of -NN with smaller .)
Quiz#1: How does affect the classifier? What happens if ? What if ?
What distance function should we use?
The k-nearest neighbor classifier fundamentally relies on a distance metric. The better that metric reflects label similarity, the better the classified will be. The most common choice is the Minkowski distance Quiz#2: This distance definition is pretty general and contains many well-known distances as special cases. Can you identify the following candidates?
:
:
:
The NN classifier is still widely used today, but often with learned metrics. For more information on metric learning check out the Large Margin Nearest Neighbors (LMNN) algorithm to learn a pseudo-metric (nowadays also known as the triplet loss) or FaceNet for face verification.
Brief digression (Bayes optimal classifier)
Example: Assume (and this is almost never the case) you knew , then you would simply predict the most likely label.
Although the Bayes optimal classifier is as good as it gets, it still can make mistakes. It is always wrong if a sample does not have the most likely label. We can compute the probability of that happening precisely (which is exactly the error rate):
Assume for example an email can either be classified as spam or ham . For the same email the conditional class probabilities are:
In this case the Bayes optimal classifier would predict the label as it is most likely, and its error rate would be .
Why is the Bayes optimal classifier interesting, if it cannot be used in practice? The reason is that it provides a highly informative lower bound of the error rate. With the same feature representation no classifier can obtain a lower error. We will use this fact to analyze the error rate of the NN classifier.
Briefer digression: Best constant predictor
While we are on the topic, let us also introduce an upper bound on the error --- i.e. a classifier that we will (hopefully) always beat. That is the constant classifier, which essentially predicts always the same constant independent of any feature vectors. The best constant in classification is the most common label in the training set. Incidentally, that is also what the -NN classifier becomes if . In regression settings, or more generally, the best constant is the constant that minimizes the loss on the training set (e.g. for the squared loss it is the average label in the training set, for the absolute loss the median label). The best constant classifier is important for debugging purposes -- you should always be able to show that your classifier performs significantly better on the test set than the best constant.
1-NN Convergence Proof
Cover and Hart 1967[1]: As , the -NN error is no more than twice the error of the Bayes Optimal classifier.
(Similar guarantees hold for .)
small
large
Let be the nearest neighbor of our test point . As , ,
i.e. .
(This means the nearest neighbor is identical to .)
You return the label of .
What is the probability that this is not the label of ?
(This is the probability of drawing two different label of )
where the inequality follows from and . We also used that .
In the limit case, the test point and its nearest neighbor are identical.
There are exactly two cases when a misclassification can occur:
when the test point and its nearest neighbor have different labels.
The probability of this happening is the probability of the two red events:
Good news:
As , the -NN classifier is only a factor 2 worse than the best possible classifier.
Bad news: We are cursed!!
Curse of Dimensionality
Distances between points
The NN classifier makes the assumption that similar points share similar labels. Unfortunately, in high dimensional spaces, points that are drawn from a probability distribution, tend to never be close together. We can illustrate this on a simple example. We will draw points uniformly at random within the unit cube (illustrated in the figure) and we will investigate how much space the nearest neighbors of a test point inside this cube will take up.
Formally, imagine the unit cube . All training data is sampled uniformly
within this cube, i.e. , and we are considering the nearest neighbors of such a test point.
2
0.1
10
0.63
100
0.955
1000
0.9954
Let be the edge length of the smallest hyper-cube that contains all -nearest neighbor of a test point.
Then and .
If , how big is ?
So as almost the entire space is needed to find the -NN. This breaks down the -NN assumptions, because the -NN are not particularly closer (and therefore more similar) than any other data points in the training set. Why would the test point share the label with those -nearest neighbors, if they are not actually similar to it?
Figure demonstrating ``the curse of dimensionality''. The histogram plots show the distributions of all pairwise distances
between randomly distributed points within -dimensional unit squares. As the number of dimensions grows, all distances concentrate within a very small range.
One might think that one rescue could be to increase the number of training samples, , until the nearest neighbors are truly close to the test point. How many data points would we need such that becomes truly small?
Fix ,
which grows exponentially! For we would need far more data points than there are electrons in the universe...
Distances to hyperplanes
So the distance between two randomly drawn data points increases drastically with their dimensionality. How about the distance to a hyperplane?
Consider the following figure. There are two blue points and a red hyperplane. The left plot shows the scenario in 2d and the right plot in 3d. As long as , the distance between the two points is . When a third dimension is added, this extends to , which must be at least as large (and is probably larger). This confirms again that pairwise distances grow in high dimensions. On the other hand, the distance to the red hyperplane remains unchanged as the third dimension is added. The reason is that the normal of the hyper-plane is orthogonal to the new dimension. This is a crucial observation. In dimensions, dimensions will be orthogonal to the normal of any given hyper-plane. Movement in those dimensions cannot increase or decrease the distance to the hyperplane --- the points just shift around and remain at the same distance.
As distances between pairwise points become very large in high dimensional spaces, distances to hyperplanes become comparatively tiny. For machine learning algorithms, this is highly relevant. As we will see later on, many classifiers (e.g. the Perceptron or SVMs) place hyper planes between concentrations of different classes. One consequence of the curse of dimensionality is that most data points tend to be very close to these hyperplanes and it is often possible to perturb input slightly (and often imperceptibly) in order to change a classification outcome. This practice has recently become known as the creation of adversarial samples, whose existents is often falsely attributed to the complexity of neural networks.
The curse of dimensionality has different effects on distances between two points and distances between points and hyperplanes.
An animation illustrating the effect on randomly sampled data points in 2D, as a 3rd dimension is added (with random coordinates). As the points expand along the 3rd dimension they spread out and their pairwise distances increase. However, their distance to the hyper-plane (z=0.5) remains unchanged --- so in relative terms the distance from the data points to the hyper-plane shrinks compared to their respective nearest neighbors.
Data with low dimensional structure
However, not all is lost.
Data may lie in low dimensional subspace or on sub-manifolds. Example: natural images (digits, faces). Here, the true dimensionality of the data can be much lower than its ambient space. The next figure shows an example of a data set sampled from a 2-dimensional manifold (i.e. a surface in space), that is embedded within 3d. Human faces are a typical example of an intrinsically low dimensional data set. Although an image of a face may require 18M pixels, a person may be able to describe this person with less than 50 attributes (e.g. male/female, blond/dark hair, ...) along which faces vary.
An example of a data set in 3d that is drawn from an underlying 2-dimensional manifold. The blue points are confined to the pink surface area, which is embedded in a 3-dimensional ambient space.
k-NN summary
-NN is a simple and effective classifier if distances reliably reflect a semantically meaningful notion of the dissimilarity. (It becomes truly competitive through metric learning)
As , -NN becomes provably very accurate, but also very slow.
As , points drawn from a probability distribution stop being similar to each other, and the NN assumption breaks down.
Reference
[1]Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification[J]. Information Theory, IEEE Transactions on, 1967, 13(1): 21-27