General Context

Clustering

Unsupervised machine learning technique used to group similar data points into clusters of groups.

In clustering, the number of clusters or groups is not predefined by the data.

Possible usecases

Data segmentation

Anomaly detection

Centroid

Centroid

Centre point of a cluster.

For a cluster with data points: the centroid is defined as the average of the set of points in the cluster,

and this centroid defines the cluster.

K-means clustering

Distance

The straight line Euclidean distance is defined

Machine-learning models that rely on distances between data points are called distance based models.

Algorithm

How to group data?

Beginning with the centroids,

randomly put the centroids in the space
assign the points to the nearest centroids
recompute centroids based off the points in the item
reassign to nearest centroids

Formalising this to the K-means algorithms, we get the steps:

randomly initialise centroids
repeat until convergence:
- for , index of cluster centroid closest to
- for centroid of data points assigned to cluster

Centroids

Measure Quality of Clusters

Distortion

Average distance of sample to its centroid.

The distortion can be measured as:

Local Optima

K-Means can stumble in to a local optima - where the outcome given is a stable configuration, and the centroids will not move (convergence), but the clusters are not the best possible clusters.

## Convergence

Theorem

Each step in the K-Means algorithm never increases distortion.

Picking Number of Clusters

Elbow Method

Effectively:

calculate the distortion from calculating multiple clusters
choose the where an elbow happens

### Business Needs

If there are 3 size groups allowed, the clustering of sizes should aim to find 3 sizes.

Variants

Centroids from Points

Pick the centroids from the points

Instead of picking the centroids from the entire space, the centroids are picked from the points.

### K-Medoids

K-Medoids

Pick data points closest to centroids, and use them as centroids

Explorer

Clustering

General Context

Centroid

K-means clustering

Distance

Algorithm

Centroids

Measure Quality of Clusters

Local Optima

Picking Number of Clusters

Elbow Method

Variants

Centroids from Points

Graph View

Table of Contents

Backlinks