Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables. It is a tool for unsupervised learning and is often used as a dimension reduction technique for regression problems to tackle the curse of dimensionality in datasets.
The Curse of Dimensionality
The dimensionality of a dataset is the number of attributes or features present in the dataset. As the dimensionality of the problem increases, the probability of adding
noise features that are not truly associated with the response increases, leading to a deterioration in the fitted model, and consequently an increased test
set error. Thus, higher dimensionality of the dataset exacerbates the risk of overfitting. Even if they are relevant features, the variance incurred
in fitting their coefficients may outweigh the reduction in bias that they
bring. Thus, the curse of dimensionality includes the role of the bias-variance trade-off and the danger of overfitting.
The above graphs show plotting of datapoints in 1-D, 2-D and 3-D feature spaces respectively. We can observe that the increase in number of features leads to higher dimensionality of the dataset, therefore needing exponentially large number of training datapoints.
Principal Components
When faced with a large set of correlated variables, principal components allow us to summarize this set with
a smaller number of representative variables that collectively explain most
of the variability in the original set. The principal component directions
are the directions in feature space along which
the original data highly varies. These directions also define lines and
subspaces that are as close as possible to the data cloud. The first principal component direction of the data is that along which the observations vary the most.
Principal Component Analysis
Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting. Non-parametric approaches often perform poorly when the number of features is large. PCA reduces the dimension of a n × p data matrix X, where :
n (the number of observations) is much greater than p (the number of features)
Steps to Perform Principal Component Analysis with the Help of an Example :
Datapoints➔
Features🠋 | \(P_1\) | \(P_2\) | \(P_3\) | \(P_4\) |
\(F_1\) | 3 | 4 | 16 | 8 |
\(F_2\) | 5 | 10 | 15 | 3 |
Step 1: Find the mean of variables for all datapoints of individual features.
\begin{equation}Mean = \overline{F} = \frac{1}{n} \sum P_i\end{equation} where n in the total number of datapoints.
Here, for feature \(F_1\) and \(F_2\) given 4 datapoints, \begin{equation} \overline{F_1} = \frac{3+4+16+8}{4} =\frac{31}{4} = 7.75\end{equation} \begin{equation} \overline{F_2} = \frac{5+10+15+3}{4} =\frac{33}{4} = 8.25\end{equation}
Step 2: Calculate the Covariance Matrix of the dataset:
Covariance is a measure of the relationship between two random variables and evaluates the joint variability of the data.
The Covariance Matrix of the dataset having feature space is an n × n symmetric, positive, definite matrix given by :\begin{bmatrix}
Cov(X_1 ,X_1) & Cov(X_1 ,X_2) & .... & Cov(X_1 ,X_n)\\
Cov(X_2 ,X_1)& Cov(X_2, X_2) & .... & Cov(X_2, X_n)\\
\vdots & \vdots & \vdots\vdots\vdots\vdots & \vdots\\
Cov(X_n,X_1)& Cov(X_n,X_2) & .... & Cov(X_n,X_n)
\end{bmatrix}
The components of the covariance matrix are given by the covariance between two random variables X and Y defined as:
\begin{equation} Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]\end{equation} where \(X - \mu_X\) and \(Y - \mu_Y\) are the deviations of the two variables from their respective mean values and the covariance is the expected product of deviations.
For the given example, the covariance matrix of the features is :
\begin{bmatrix}
Cov(F_1,F_1) & Cov(F_1,F_2)\\
Cov(F_2,F_1) & Cov(F_2,F_2)
\end{bmatrix}
This equals to: \begin{bmatrix} \frac{1}{n-1}\sum_{k=1}^{n}(F_{1k}-\overline{F_1})(F_{1k}-\overline{F_1}) & \frac{1}{n-1}\sum_{k=1}^{n}(F_{1k}-\overline{F_1})(F_{2k}-\overline{F_1})\\ \frac{1}{n-1}\sum_{k=1}^{n}(F_{2k}-\overline{F_1})(F_{1k}-\overline{F_1})& \frac{1}{n-1}\sum_{k=1}^{n}(F_{2k}-\overline{F_1})(F_{2k}-\overline{F_1}) \end{bmatrix}
Calculating the covariances: \begin{equation}\begin{aligned}Cov(X_1 ,X_2) =\frac{1}{3}((3-7.75)(3-7.75)+(4-7.75)(4-7.75)+\\(16-7.75)(16-7.75)+(8-7.75)(8-7.75)) = 34.92\end{aligned}\end{equation}
\begin{equation}\begin{aligned}Cov(X_1 ,X_2) =\frac{1}{3}((3-7.75)(5-8.25)+(4-7.75)(10-8.25)+\\(16-7.75)(15-8.25)+(8-7.75)(3-8.25))=21.125\end{aligned}\end{equation}
\begin{equation}\begin{aligned}Cov(X_1 ,X_2) =\frac{1}{3}((3-7.75)(5-8.25)+(4-7.75)(10-8.25)+\\(16-7.75)(15-8.25)+(8-7.75)(3-8.25))=21.125\end{aligned}\end{equation}
\begin{equation}\begin{aligned}Cov(X_1 ,X_2) =\frac{1}{3}((5-8.25)(5-8.25)+(10-8.25)(10-8.25)+\\(15-8.25)(15-8.25)+(3-8.25)(3-8.25))=28.92\end{aligned}\end{equation}
Therefore, the Covariance Matrix( let S ) is : \( S=\begin{bmatrix} 34.92 & 21.125\\ 21.125 & 28.92 \end{bmatrix}\)
We can see the covariance matrix calculated in this step is always a symmetric matrix. One of the properties of symmetric matrices is that they can be undergo spectral decomposition as follows :
\begin{equation}S = ODO^{T}\end{equation}
Basically, the process of doing the spectral decomposition of the covariance matrix of S is called principal components analysis (PCA).
Step 3: Find Eigen Values of the covariance matrix :
To find eigen values, we have to solve for \(\lambda\) in the equation :
det( S - \(\lambda\)I) = 0
For our problem,\begin{equation}\begin{vmatrix}
34.92-\lambda & 21.125\\
21.125 & 28.92-\lambda
\end{vmatrix}=0\end{equation}
\begin{equation}=>
(34.92-\lambda)(28.92-\lambda)-(21.125)^{2}=0\end{equation}
\begin{equation}=>
\lambda=10.58 , 53.25\end{equation}
Therefore the two eigenvalues denoted by:
\(\lambda_1=10.58 ; \lambda_2=53.25\)
Step 4: Find the corresponding Unit Eigen Vectors for each of the eigen values.
The Eigen Vectors for the corresponding eigen values are given by the equation :
\(Sv = \lambda v ,\) where v is the eigen vector.
\(\therefore\) For our sum, to find the corresponding eigen vectors, we take v as a 2x1 matrix : \(v = \begin{bmatrix}x_1\\x_2 \end{bmatrix}\).
\begin{equation}\begin{bmatrix} 34.92 & 21.125\\ 21.125 & 28.92 \end{bmatrix}\begin{bmatrix}x_1\\x_2 \end{bmatrix} = \lambda \begin{bmatrix}x_1\\x_2 \end{bmatrix} \end{equation}
For \(\lambda_1=10.58\) ,
\(34.92x_1 + 21.125x_2 = 10.58 x_1\)
=> \(24.34x_1 = -21.125x_2\)
=> \(\frac{x_1}{x_2} = \frac{-21.125}{24.34}\)
\(\therefore v_1 = \begin{bmatrix}x_1\\x_2 \end{bmatrix} = \begin{bmatrix}-21,125\\24,34 \end{bmatrix}\)
To find the unit eigen vector, we find the length of \(v_1\) given by :
\(\left \| v_1 \right \| = \sqrt{(-21.125)^{2}+(24.34)^{2}}= 32.23\)
Therefore the unit eigen vector for \(\lambda_1\) :
\(e_1 = \begin{bmatrix}\frac{-21,125}{32.23}\\\frac{24.34}{32.23} \end{bmatrix} = \begin{bmatrix}-0.655\\0.755 \end{bmatrix}\)
In a similar manner, we can find out the other unit eigen vector for the second eigen value.
Step 5: Compute the Principal Components.
Principal Component(C_datapoint) = \(e^{T}\begin{bmatrix}x_{1k}-\overline{x_1} \\x_{2k}- \overline{x_2} \\ \vdots \\x_{nk}- \overline{x_n} \end{bmatrix}\) for 'n' number of features for each of 'k' datapoints
For our sum, we calculate the first principal component for each example datapoints in terms of the unit eigen vector already calculated as \(e_1\) .
\(\therefore\) The Principal Component for the datapoint \(P_1\) in our dataset can be calculated as :
\(C_{P_1} = e_1^{T}\begin{bmatrix}3 - 7.75 \\5 - 8.25 \end{bmatrix}\)
= \(\begin{bmatrix}-0.655 & 0.755 \end{bmatrix} \begin{bmatrix}-4.75 \\-3.25 \end{bmatrix} = 0.6575\)
Calculating the first principal component for all the datapoints, we get :
Datapoints➔
Features🠋 | P1 | P2 | P3 | P4 |
F1 | 3 | 4 | 16 | 8 |
F2 | 5 | 10 | 15 | 3 |
First Principal Component (C ) | 0.6575 | 3.2453 | -0.3075 | -4.1275 |
Conclusion :
PCA is a dimensionality reduction algorithm. Mapping data to a low-dimensional space is called dimensionality reduction. It transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data lies on the first coordinate that forms the first principal component.
Comments
Post a Comment