
- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Mini Batch K-means clustering algorithm in Machine Learning
Introduction
Clustering is a technique to group data points into various subgroups such that each point within each subgroup are similar. It is an unsupervised algorithm and there are no labels or ground truth. Mini batch K Means is a variant of the K−Means algorithm that trains from batches at random from memory.
In this article let us understand Mini Batch K−Means in detail. Before moving on to Mini Batch K−Means let us have a look at K−Means in general
The K−Means clustering approach
The K−Means is an iterative approach that tries to group data points into K separate subgroups such that they are non−overlapping. The points within a cluster are similar as possible and the points between two clusters are as dissimilar as possible. The algorithm also makes the sum of intra−cluster distances between the points and centroid of a cluster as small as possible and inter clusters distances as large as possible. A point can belong to one cluster or subgroup.
Mini Batch KM−Means clustering
The concept behind Mini batch K−Means clusters is to store small batches of fixed−size data points in memory. In each iteration, a mini−batch is randomly taken from the dataset and the data points in the mini−batch only are used to update the centroids of the clusters. This saves us from using the entire data set at once as seen in the K−Means algorithm that solves any memory issues. The algorithm converges faster. The learning rate generally decreases with the number of iterations since it is inversely proportional to the data assigned. In mini−batch the cluster updates are done using a convex combination of the data and prototypes and with the learning t−rate decreasing over iterations. When repetitions increase the effect of adding new data reduces and convergence happens faster and is observed when with two consecutive iterations the centroid does not get affected.
Working of Mini batch K−Means clustering
Centroids of clusters are randomly initialized.
A mini−batch of data is randomly selected from the original dataset.
Each data point is assigned to the centroid closest to it
The cluster centroids are calculated using the assigned points from the mini−batch
The process from 2 to 4 is repeated until no change in the centroid position
The final clusters are obtained.
Python Implementation of Minibatch K−Means
In the below example we have used KMeans clustering with mini batches on 2000 data points.The initial cluster centers are defined and the model is then trained using the data to find the final cluster centers and plot them.
from sklearn.cluster import MiniBatchKMeans from sklearn.datasets import make_blobs as blobs import matplotlib.pyplot as plt import timeit as t c = [[50, 50],[1900, 0],[1900, 900],[0, 1900]] data, data_labels = blobs(n_samples = 2000, centers = c, cluster_std = 200) color = ['pink', 'violet', 'green', 'blue'] for i in range(len(data)): plt.scatter(data[i][0], data[i][1], color = color[data_labels[i]], alpha = 0.4) k_means = MiniBatchKMeans(n_clusters=4, batch_size = 40) st = t.default_timer() k_means.fit(data) e = t.default_timer() label_a = k_means.labels_ cnt = k_means.cluster_centers_ print("Time taken : ",e-st) for i in range(len(data)): plt.scatter(data[i][0],data[i][1], color = color[label_a[i]], alpha = 0.4) for i in range(len(cnt)): plt.scatter(cnt[i][0], cnt[i][1], color = 'black')
Output
Time taken : 0.01283279599999787
The advantage of Mini Batch K means
It can handle large datasets as compared to the K−Means algorithm
It is computationally less expensive
It converges faster
Conclusion
Mini−batch K−Means is a better and newer approach to traditional Kmeans and solves some of its shortcomings like using less memory, handling large datasets in memory, and less time in convergence.