General Context
Clustering
Unsupervised machine learning technique used to group similar data points into clusters of groups.
In clustering, the number of clusters or groups is not predefined by the data.
Possible usecases
- Data segmentation
- Anomaly detection
Centroid
Centroid
Centre point of a cluster.
For a cluster
and this centroid defines the cluster.
K-means clustering
Distance
The straight line Euclidean distance is defined
Machine-learning models that rely on distances between data points are called distance based models.
Algorithm
How to group data?
Beginning with the centroids,
- randomly put the centroids in the space
- assign the points to the nearest centroids
- recompute centroids based off the points in the item
- reassign to nearest centroids
Formalising this to the K-means algorithms, we get the steps:
- randomly initialise
centroids - repeat until convergence:
- for
, index of cluster centroid closest to - for
centroid of data points assigned to cluster
- for
Distortion
Average distance of sample to its centroid.
The distortion can be measured as:
Local Optima
K-Means can stumble in to a local optima - where the outcome given is a stable configuration, and the centroids will not move (convergence), but the clusters are not the best possible clusters.
## ConvergenceTheorem
Each step in the K-Means algorithm never increases distortion.
Picking Number of Clusters
Elbow Method
Effectively:
- calculate the distortion from calculating multiple clusters
- choose the
where an elbow happens
If there are 3 size groups allowed, the clustering of sizes should aim to find 3 sizes.
Variants
Centroids from Points
### K-MedoidsPick the centroids from the points
Instead of picking the centroids from the entire space, the centroids are picked from the points.
K-Medoids
Pick data points closest to centroids, and use them as centroids