Reduce the number of features (variables/dimensions) in a dataset while retaining as much of the relevant information as possible.
Possible usecases
Visualisation
Feature extraction/transformation
Curse of Dimensionality
Sample complexity
The number of samples needed to learn a hypothesis class increases exponentially with the number of features :
Thus, if a machine learning model takes in high-dimensional input, it will suffer from the curse of dimensionality.
Key Idea
A key idea would be to remove redundant features.
Redundant features are features that are not important in distinguishing.
Given some other data, we can also change the basis to create redundant features.
## Identifying the basis
Given the transposed data matrix ,
we can decompose it into 3 matrices, , with Singular Value Decomposition:
Basic concepts
Orthonormal
Two vectors are orthonormal if they are:
normalised:
orthogonal:
Orthonormal basis
A set of vectors such that every vector can be expressed as a linear combination of the set
Existence of SVD
WLOG, let .
For any real valued matrix , there exists a factorisation
called SVD such that,
is having orthonormal columns (left singular vectors)
is and is a diagonal matrix with (singular values)
is with orthonormal columns (right singular vectors)
Thus, using SVD:
A simplified interpretation, with a data matrix
Using SVD:
Effectively, this means that every data point is a linear combination of basis vectors .
Dimensionality reduction via SVD
N singular values tells us about the importance of the new basis vectors. Less important basis vectors.
From , we can see the singular values:
Recall that singular values are ordered in size:
Thus, we can just reduce the dimensions by setting the to 0.
where, effectively we set all values after new dimension to .
We can then take this new basis (with removed dimensions) of size ():
Compressing vectors
Using the newly found basis , we can compress the matrix
getting a matrix .
The matrix defines a dimensional subspace of the feature space, and for any feature vector , multiplying with obtains the linear combination coefficients in the dimensional subspace.
For example, if the new dimensions , then given a feature vector ,
Reconstructing back to the original matrix
For a fixed r, it holds that \tilde{U}\tilde{U}^{T} \approx \mathbb{I} with approximation error dependent on r.
The data can be reconstructed in the original dimension :
Mean centered-data
To get mean centered-data:
Step 1: Compute mean feature vector over samples
Step 2: Compute mean-centred data points
Step 3: Use mean-centred data matrix
Given a mean-centered data \hat{X}^{T}, the values \frac{\sigma^{2}_{j}}{N-1} are variances of the data in the basis defined by vectors u^{(j)}
Thus, given a threshold of variance , choose a minimum such that
WIth the r found, original data points \hat{x}^{(i)} and reconstructed data points \tilde{x} are close to x.
Principle Component Analysis
Capture components that maximise the statistical variations of the data.