Common ground for finally applying the algorithm, is a dataset in a fixed format.
In the given context this refers to a (m x n) matrix. Each sample contributes to m, saying each sample is listed
as a row entry in the dataset. The columns however display the n features, which means therefore each
pixel refers to a feature. Each actual pixel value is like the realisation of that feature.
In the case of OK/NOK cat face detection
there is a best case scenario with an ever increasing number of
samples and therefore a dynamically adopted m. Best case
cause data science people tend to say, the more data the better. Of course it is not just the quantity, but also
the quality fo each single data sample in the end. The used subset contains
109 samples, which is a very, very small dataset so far. Regarding
the column, there are samples of (200 x 200) pixel images resulting in 40.000 features per sample.
Each sample is transformed from a matrix (200 x 200) to a vector (1 x 40.000). Normalization already applied,
the following code snippet gives a glance on the preprocessed dataset for PCA. So far, so good.
display(z_data)
109×40000 Matrix{Float64}:
0.0751113 0.0767358 0.0267347 … 0.744265 0.858365 1.00929
0.488294 0.205169 -0.556082 1.92465 1.97619 1.33883
-1.02671 -1.11127 -1.09033 0.100417 0.243562 0.295279
-0.491102 -0.709917 -0.944626 0.154071 0.299454 0.350202
-0.276859 -0.549375 -0.523703 -1.13363 -2.10387 -1.90167
1.32996 1.37712 1.35426 … -1.02632 -1.09783 -1.07782
1.32996 1.37712 1.35426 -1.13363 -1.09783 -0.967969
1.29936 1.37712 1.35426 -0.91901 -0.874261 -0.803198
-1.05731 -1.09522 -1.05795 0.154071 0.13178 0.240355
-0.873678 -1.19154 -0.84749 0.744265 0.6348 0.734669
1.43708 1.47345 1.20856 … 4.82197 4.37951 4.46949
-0.93489 -0.902566 -0.84749 0.261379 0.13178 0.295279
0.67193 0.63863 0.674308 1.81735 1.86441 1.77822
⋮ ⋱
-0.475799 -0.56543 -1.04176 0.261379 0.467127 0.46005
1.34526 1.39317 1.37045 -2.04574 -2.27154 -2.12137
1.34526 1.39317 1.37045 -0.704394 -0.874261 -0.748274
1.31466 1.02393 1.27331 … 0.207725 0.299454 0.295279
-1.21034 -1.33603 -1.34936 0.315033 0.243562 0.350202
-0.827769 -0.91862 -0.944626 0.744265 0.746583 0.844517
-0.73595 -0.629646 -0.539892 -1.13363 -0.259458 -0.25396
0.350566 0.221223 -0.0218333 0.207725 0.243562 0.185431
-0.93489 -1.52868 -0.118969 … -0.489778 -0.594805 -0.0342644
1.36057 1.40923 1.38664 0.744265 0.690692 0.46005
1.34526 1.39317 1.37045 -0.221508 -0.0358934 0.130507
1.32996 1.37712 1.35426 0.0467627 0.0199978 0.0755832
Finally, PCA will look at the variance of that dataset and assumes a linear relationship
between the stated features.
In case there is actually a non-linear relationship, the accuracy will just lower down
and the final model is just worse. The reality is then just not projected into the mathematical
representation the model defines.
But in general nothing prevents from applying the algorithm anyway.
It is a matter of rejection in case of low performance when it
comes to an assessment of how close the model reaches reality. An example will clearify this.
The following example is actually the most stated and basic PCA intro example. It is used,
cause there is really no alternative for introduction like this one.
Let's think about this 2-dimensional example as a graphical visualization of a number of
samples m, like in the cat images example. So there are m green points in the plot. Each
of those samples representates a measurement with 2 sensors. Instead of a camera, which
one might think of a single sensor with n measurements at once, the two sensors are just
synchronized and measure two indiviual properties, or features, at once. Those two properties
might just be the temperature T on the x-axis and the humidity H on the y-axis.
By first sight, and with delining further calculations for later, one might conclude a
direct proportional relationship between the two. If the temperature arises, the humidity does so, too.
There is noise and variance, but in general this seems to be the case though.
So, for a simple project we might want to measure those two properties to feed some application
with data. One might think about an automated watering for some plants on a balcony, where
the amount of water follows a function with temperature T and humidity H as parameters. Of
course, for a simple application and a one-time project, it is fine to set up two sensors and
integrate their data for example with an Arduino or some other microprocessor. In an
industrial context, where not just fun with building, but also economic aspects come into play,
one might simplify the measurement setup by removing e.g. the humidity sensor to save ressources.
But why can we do this? To justify this measurement setup reduction, we need could empirically proof it.
A database of measurements describes reality and serves then as the proof for this decision, cause
one can state, that it is not necessary to have both sensors as long as there is a linear relationship.
It is neglected though, a small error between the linear model and real measurements. Either there is another
relationship on top of the major linear one, which is not catched here or it is just noise which is
neglectable as well. This then remains as an engineering task on how detailed one wants to
investigate on the subject, or just accept the reached quality as it is with a linear approach.
But how is this example now related to PCA? Well, PCA discovers this relationship and
delivers a mathematical foundation for the decision made in the previous passage. One could argue though,
that it is easy to see by just visualising the data as it is done in figure 1. This is a valid
statement, but first of all, something has to be visualized and examined manually. So for a hobby project
this is fine, but for a justification in front of colleagues or customers not. And
in case there are 10 measured features, or 100 or thousand or even more, is is
really not practicable to test all combinations and check by some kind of subjective opinion
if there is a linear relationship. The perfect relationship without any disturbances, which might just consists in
wind speed or just lightning, is shown by the red samples in figure 2.