General Context

Clustering

Unsupervised machine learning technique used to group similar data points into clusters of groups.

In clustering, the number of clusters or groups is not predefined by the data.

Possible usecases

  • Data segmentation
  • Anomaly detection

Centroid

Centroid

Centre point of a cluster.

For a cluster with data points: the centroid is defined as the average of the set of points in the cluster,

and this centroid defines the cluster.

K-means clustering

Distance

The straight line Euclidean distance is defined

Machine-learning models that rely on distances between data points are called distance based models.

Algorithm

How to group data?

Beginning with the centroids,

  • randomly put the centroids in the space
  • assign the points to the nearest centroids
  • recompute centroids based off the points in the item
  • reassign to nearest centroids

Formalising this to the K-means algorithms, we get the steps:

  • randomly initialise centroids
  • repeat until convergence:
    • for , index of cluster centroid closest to
    • for centroid of data points assigned to cluster
Step 1Randomly initialise centroidsRepeat until convergenceStep 2a)Pick the clusters for each pointStep 2b)Reassign the centroid to the centerCheck if the points are clusteredto the closest centroid afterupdating the centroid.Check if the centroid chosen istruly in the center of its set ofpointsIf both satisfiedCONVERGENCEK-MeansAlgorithm ## Measure Quality of Clusters

Distortion

Average distance of sample to its centroid.

The distortion can be measured as:

Local Optima

K-Means can stumble in to a local optima - where the outcome given is a stable configuration, and the centroids will not move (convergence), but the clusters are not the best possible clusters.

Solution 1Solution 2Lower distortionHigher distortionK-means algorithmcan return bothresults. (Can get stuck inlocal optima) ## Convergence

Theorem

Each step in the K-Means algorithm never increases distortion.

Picking Number of Clusters

Elbow Method

Effectively:

  • calculate the distortion from calculating multiple clusters
  • choose the where an elbow happens
DistortionNo. of Clusterselbowpicked clusternumberElbow Method ### Business Needs

If there are 3 size groups allowed, the clustering of sizes should aim to find 3 sizes.

Variants

Centroids from Points

Pick the centroids from the points

Instead of picking the centroids from the entire space, the centroids are picked from the points.

DataSelect centroids from pointsChoose randomcentroids ### K-Medoids

K-Medoids

Pick data points closest to centroids, and use them as centroids

K-MeansFrom centroidK-MedoidsSnap to pointsSnap points