Scikit Learn - K-Nearest Neighbors (KNN)



This chapter will help you in understanding the nearest neighbor methods in Sklearn.

Neighbor based learning method are of both types namely supervised and unsupervised. Supervised neighbors-based learning can be used for both classification as well as regression predictive problems but, it is mainly used for classification predictive problems in industry.

Neighbors based learning methods do not have a specialised training phase and uses all the data for training while classification. It also does not assume anything about the underlying data. That’s the reason they are lazy and non-parametric in nature.

The main principle behind nearest neighbor methods is −

  • To find a predefined number of training samples closet in distance to the new data point

  • Predict the label from these number of training samples.

Here, the number of samples can be a user-defined constant like in K-nearest neighbor learning or vary based on the local density of point like in radius-based neighbor learning.

sklearn.neighbors Module

Scikit-learn have sklearn.neighbors module that provides functionality for both unsupervised and supervised neighbors-based learning methods. As input, the classes in this module can handle either NumPy arrays or scipy.sparse matrices.

Types of algorithms

Different types of algorithms which can be used in neighbor-based methods’ implementation are as follows −

Brute Force

The brute-force computation of distances between all pairs of points in the dataset provides the most naïve neighbor search implementation. Mathematically, for N samples in D dimensions, brute-force approach scales as 0[DN2]

For small data samples, this algorithm can be very useful, but it becomes infeasible as and when number of samples grows. Brute force neighbor search can be enabled by writing the keyword algorithm=’brute’.

K-D Tree

One of the tree-based data structures that have been invented to address the computational inefficiencies of the brute-force approach, is KD tree data structure. Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It recursively partitions the parameters space along the data axes by dividing it into nested orthographic regions into which the data points are filled.

Advantages

Following are some advantages of K-D tree algorithm −

Construction is fast − As the partitioning is performed only along the data axes, K-D tree’s construction is very fast.

Less distance computations − This algorithm takes very less distance computations to determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠 (𝑵)] distance computations.

Disadvantages

Fast for only low-dimensional neighbor searches − It is very fast for low-dimensional (D < 20) neighbor searches but as and when D grow it becomes inefficient. As the partitioning is performed only along the data axes,

K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’.

Ball Tree

As we know that KD Tree is inefficient in higher dimensions, hence, to address this inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it recursively divides the data, into nodes defined by a centroid C and radius r, in such a way that each point in the node lies within the hyper-sphere defined by centroid C and radius r. It uses triangle inequality, given below, which reduces the number of candidate points for a neighbor search

$$\arrowvert X+Y\arrowvert\leq \arrowvert X\arrowvert+\arrowvert Y\arrowvert$$

Advantages

Following are some advantages of Ball Tree algorithm −

Efficient on highly structured data − As ball tree partition the data in a series of nesting hyper-spheres, it is efficient on highly structured data.

Out-performs KD-tree − Ball tree out-performs KD tree in high dimensions because it has spherical geometry of the ball tree nodes.

Disadvantages

Costly − Partition the data in a series of nesting hyper-spheres makes its construction very costly.

Ball tree neighbor searches can be enabled by writing the keyword algorithm=’ball_tree’.

Choosing Nearest Neighbors Algorithm

The choice of an optimal algorithm for a given dataset depends upon the following factors −

Number of samples (N) and Dimensionality (D)

These are the most important factors to be considered while choosing Nearest Neighbor algorithm. It is because of the reasons given below −

  • The query time of Brute Force algorithm grows as O[DN].

  • The query time of Ball tree algorithm grows as O[D log(N)].

  • The query time of KD tree algorithm changes with D in a strange manner that is very difficult to characterize. When D < 20, the cost is O[D log(N)] and this algorithm is very efficient. On the other hand, it is inefficient in case when D > 20 because the cost increases to nearly O[DN].

Data Structure

Another factor that affect the performance of these algorithms is intrinsic dimensionality of the data or sparsity of the data. It is because the query times of Ball tree and KD tree algorithms can be greatly influenced by it. Whereas, the query time of Brute Force algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms produces faster query time when implanted on sparser data with smaller intrinsic dimensionality.

Number of Neighbors (k)

The number of neighbors (k) requested for a query point affects the query time of Ball tree and KD tree algorithms. Their query time becomes slower as number of neighbors (k) increases. Whereas the query time of Brute Force will remain unaffected by the value of k.

Number of query points

Because, they need construction phase, both KD tree and Ball tree algorithms will be effective if there are large number of query points. On the other hand, if there are a smaller number of query points, Brute Force algorithm performs better than KD tree and Ball tree algorithms.

Advertisements