All of this is covered by PCA

When talking about PCA, there are some punch lines, that everyone who talks about it drops all the time. Let's have a look on them without entering too deep into detail, just to have them already heard and get somehow familiar with.

One of the most heard buzzwords in any introduction to PCA is the term: Unsupervised learning technique. What this means is, that there are samples of data, like in the previous section: images. One image is one sample. Starting off the whole project with taking pictures, in the beginning there is no label on that sample. Again in other words, there is not mentioned if an image is OK or NOK, it is just a sample. It is unsupervised cause the aim is to to get some classification knowledge or consequential action derived out of the labelless data. There is no backpropagation or correction algorithm involved to solve the task by being guided through the supervision of a label. The abscence of this - often described as target or y - label describes the term unsupervised. Applied via Eigenfaces or Fisherfaces labels are definitely used, but PCA itself in its original form does not need any labels to perform the calculation. Eigenfaces and Fisherfaces is located on top of PCA, like shown in figure 1.

PCA and Eigenfaces deployment schema
Fig.1 - Model deployment schema.

The second phrase, which is very important when it comes to PCA: Linear transformation. So what PCA basically does is transforming each sample. From an information content point of view, it is not decreasing or increasing the valuable content of the image. It is just changing its representation internally. Before and after PCA the information content of a sample remains the same. The internal representation of the sample, and here the image, does change though. But as one might state x = 3, a linear transformation might be 2x = 6, which does not change the statement's content. The general transformation rule is Y = aX + b. In our example a equals 2 and b is 0. The final Y is 6. In most of the use cases where PCA is applied, we do not have a single dimension x which is transformed into y, but multiple dimensions x_1, x_2, ..., x_n correspondingly transformed into y_1, y_2, ..., y_n. This means for example: y_1 = a_1 * x_1 + a_2 * y_1 + ... + a_n * x_n + b, which is then just the projection of multiple into dimension into an output dimension, here y_1. After performing the substraction of the meand and the divsion by the standard deviation per feature, the incoming sample for the PCA algorithm looks like the image in figure 2. The algorithm itself receives it as a 1 x 40.000 vector instead of a 200 x 200 pixels array.

PCA and Eigenfaces deployment schema
Fig.2 - Linearily transformed image by -mean = b and standard deviation = 1/a. Normalized image/sample as the algorithm will receive the data.

Depending on the algorithmic software implementation, is the usage of a so called normalization. It is not mandatory from a computationally point of view though. But for performance reasons of the results though. This refers to the sample's features. Each feature (can or) has to be normalized. So what is a feature? Coming back to the image, each pixel within a single image represents the realisation of a single feature within that sample. What this concludes is, that there are (m x n) features to be normalized. Mathematically there are multiple of normalization approaches. Depending on the programming language or library used for PCA, different options are available. A common default variant for this is Z-score normalization. It is realized by first calculating the mean per feature over all samples. Similarily, the standard deviation per feature is calculated over all samples. The results are two vectors with dimension (1 x features).

Written in Julia programming language, this might look like the following:
# Vectorize input data to nxm matrix
# n: number of samples, m: number of features
data = Array{Float64}(reshape(images, (n_images, n_features)));

# Calculate mean and standard deviation per feature
mean_vector = mean(data, dims=1);
std_vector = std(data, dims=1);

println("Dataset (samples x features): ", size(data))
println("Feature mean vector (1 x features): ", size(mean_vector))
println("Feature standard deviation vector (1 x features): ", size(std_vector))

In our specific case, the dataset, called data, corresponds to an array of samples x 40.000 features. 40.000 cause, each sample consists of 200 x 200 pixels.

As the name already reveals, the data contains some principle components. Further description for them will folllow, but in general they describe the contribution of variance of a dimension within the data. Those principle components do not necessarily map 1-to-1 to a feature in the original space. In case the data fullfills some requirements, described in a following article, multiple features will be encoded onto a principle component.

Last but not least, the probably most important concept here is: Covariance matrix. It describes the linear dependency between features. Example: In case one feature, here pixel, is positively correlated with another feature (another pixel), a rising (brighter) pixel grayscale value will also result in an arising grayscale value for the other one. A later post will highlight this very important field in data science.

Just to be said: PCA aims to minimize those covariances and maximize the variances by linearly transforming a dataset. This is the overall target.