What you call PCA

Common ground for finally applying the algorithm, is a dataset in a fixed format. In the given context this refers to a (m x n) matrix. Each sample contributes to m, saying each sample is listed as a row entry in the dataset. The columns however display the n features, which means therefore each pixel refers to a feature. Each actual pixel value is like the realisation of that feature.

In the case of OK/NOK cat face detection there is a best case scenario with an ever increasing number of samples and therefore a dynamically adopted m. Best case cause data science people tend to say, the more data the better. Of course it is not just the quantity, but also the quality fo each single data sample in the end. The used subset contains 109 samples, which is a very, very small dataset so far. Regarding the column, there are samples of (200 x 200) pixel images resulting in 40.000 features per sample. Each sample is transformed from a matrix (200 x 200) to a vector (1 x 40.000). Normalization already applied, the following code snippet gives a glance on the preprocessed dataset for PCA. So far, so good.

              
                display(z_data)
              
            
              
              109×40000 Matrix{Float64}:
                0.0751113   0.0767358   0.0267347  …   0.744265    0.858365    1.00929
                0.488294    0.205169   -0.556082       1.92465     1.97619     1.33883
              -1.02671    -1.11127    -1.09033        0.100417    0.243562    0.295279
              -0.491102   -0.709917   -0.944626       0.154071    0.299454    0.350202
              -0.276859   -0.549375   -0.523703      -1.13363    -2.10387    -1.90167
                1.32996     1.37712     1.35426    …  -1.02632    -1.09783    -1.07782
                1.32996     1.37712     1.35426       -1.13363    -1.09783    -0.967969
                1.29936     1.37712     1.35426       -0.91901    -0.874261   -0.803198
              -1.05731    -1.09522    -1.05795        0.154071    0.13178     0.240355
              -0.873678   -1.19154    -0.84749        0.744265    0.6348      0.734669
                1.43708     1.47345     1.20856    …   4.82197     4.37951     4.46949
              -0.93489    -0.902566   -0.84749        0.261379    0.13178     0.295279
                0.67193     0.63863     0.674308       1.81735     1.86441     1.77822
                ⋮                                  ⋱                          
              -0.475799   -0.56543    -1.04176        0.261379    0.467127    0.46005
                1.34526     1.39317     1.37045       -2.04574    -2.27154    -2.12137
                1.34526     1.39317     1.37045       -0.704394   -0.874261   -0.748274
                1.31466     1.02393     1.27331    …   0.207725    0.299454    0.295279
              -1.21034    -1.33603    -1.34936        0.315033    0.243562    0.350202
              -0.827769   -0.91862    -0.944626       0.744265    0.746583    0.844517
              -0.73595    -0.629646   -0.539892      -1.13363    -0.259458   -0.25396
                0.350566    0.221223   -0.0218333      0.207725    0.243562    0.185431
              -0.93489    -1.52868    -0.118969   …  -0.489778   -0.594805   -0.0342644
                1.36057     1.40923     1.38664        0.744265    0.690692    0.46005
                1.34526     1.39317     1.37045       -0.221508   -0.0358934   0.130507
                1.32996     1.37712     1.35426        0.0467627   0.0199978   0.0755832
              
            

Finally, PCA will look at the variance of that dataset and assumes a linear relationship between the stated features. In case there is actually a non-linear relationship, the accuracy will just lower down and the final model is just worse. The reality is then just not projected into the mathematical representation the model defines. But in general nothing prevents from applying the algorithm anyway. It is a matter of rejection in case of low performance when it comes to an assessment of how close the model reaches reality. An example will clearify this. The following example is actually the most stated and basic PCA intro example. It is used, cause there is really no alternative for introduction like this one.

Let's think about this 2-dimensional example as a graphical visualization of a number of samples m, like in the cat images example. So there are m green points in the plot. Each of those samples representates a measurement with 2 sensors. Instead of a camera, which one might think of a single sensor with n measurements at once, the two sensors are just synchronized and measure two indiviual properties, or features, at once. Those two properties might just be the temperature T on the x-axis and the humidity H on the y-axis. By first sight, and with delining further calculations for later, one might conclude a direct proportional relationship between the two. If the temperature arises, the humidity does so, too. There is noise and variance, but in general this seems to be the case though.

PCA simple 2d introduction with linear relationship
Fig.1 - 2-dimensional sensor data distribution with linear relationship, noise and variance in green.

So, for a simple project we might want to measure those two properties to feed some application with data. One might think about an automated watering for some plants on a balcony, where the amount of water follows a function with temperature T and humidity H as parameters. Of course, for a simple application and a one-time project, it is fine to set up two sensors and integrate their data for example with an Arduino or some other microprocessor. In an industrial context, where not just fun with building, but also economic aspects come into play, one might simplify the measurement setup by removing e.g. the humidity sensor to save ressources.

But why can we do this? To justify this measurement setup reduction, we need could empirically proof it. A database of measurements describes reality and serves then as the proof for this decision, cause one can state, that it is not necessary to have both sensors as long as there is a linear relationship. It is neglected though, a small error between the linear model and real measurements. Either there is another relationship on top of the major linear one, which is not catched here or it is just noise which is neglectable as well. This then remains as an engineering task on how detailed one wants to investigate on the subject, or just accept the reached quality as it is with a linear approach.

But how is this example now related to PCA? Well, PCA discovers this relationship and delivers a mathematical foundation for the decision made in the previous passage. One could argue though, that it is easy to see by just visualising the data as it is done in figure 1. This is a valid statement, but first of all, something has to be visualized and examined manually. So for a hobby project this is fine, but for a justification in front of colleagues or customers not. And in case there are 10 measured features, or 100 or thousand or even more, is is really not practicable to test all combinations and check by some kind of subjective opinion if there is a linear relationship. The perfect relationship without any disturbances, which might just consists in wind speed or just lightning, is shown by the red samples in figure 2.

PCA simple 2d introduction with linear relationship and no disturbances
Fig.2 - 2-dimensional sensor data distribution with linear relationship, noise and variance in green. In case of no disturbances, the suspected relationship is linear and visualized by the red scatter points.

Well, the red scatter points, are the ones, no measurement process will generate. So a proof for just not measuring humidity for this application, at this specific balcony in the worst case, can be done by PCA here.