Making small, preparing for PCA

No one in this world is able to hear the well-worned and tritened phrases from data science related people anymore: "Data is the new gold", "90% of my work consists in preprocessing" and all the other default data science and data engineering platitudes. Nevertheless, they are quite correct.

So PCA may be seen as a preprocessing task itself, by removing redundant features in the raw data. This is then the case, when some features are just slight variations of some other feature in the dataset. Another thing PCA might remove is noise, in case a dataset contains features, which are just noise and do not bring any information to the table. The final result then will be a reduced dataset for the actual task, like classification or regression. Or in our case - Eigenfaces algorithm. Stating that, it combines further the concept of dimensionality reduction into the prepreprocessing stage.

But before coming to the point applying PCA as a preprocessing algorithm, some basic sanity checks and structuring of raw data needs to be done even before. A preprocessing for the preprocessing algorithm as one might formulate.

As mentioned in the second article of this row, standardization and normalization comes into play. Both terms have a tremendous influence on the outcome of the algorithm and should not be rejected by default. While there are multiple realizations of the concepts normalization available, a standard approach is z-normalization. Here it is applied to the images. After the first preprocessing step - reshaping each sample from 200 x 200 matrix into a 1 x 40.000 vector. The dataset with N x 40.000, where N stands for the number of images, is z-transformed as a second step. In Julia and Python first reshaping step as preprocessing for PCA looks like the following code snippets.

              
                # Vectorize input data to nxm matrix
                data = Array{Float64}(reshape(images, (n_images, n_features)));
                # Calculate mean and standard deviation per feature
                mean_vector = mean(data, dims=1);
                std_vector = std(data, dims=1);

                println("Dataset (samples x features): ", size(data));
                println("Feature mean vector (1 x features): ", size(mean_vector));
                println("Feature standard deviation vector (1 x features): ", size(std_vector));
                println("Mean of first feature: ", mean_vector[1])
                println("Standard deviation of first feature: ", std_vector[1])
              
            

The full program can be executed after checking out the Gitlab repository https://gitlab.com/PaulOberm/learn-julia-lang/-/tree/main and running the Jupyter notebook pca_2.ipynb. The output of the above code snippet should look like this:

              
                Dataset (samples x features): (109, 40000)
                Feature mean vector (1 x features): (1, 40000)
                Feature standard deviation vector (1 x features): (1, 40000)
                Mean of first feature: 0.6003597769382981
                Standard deviation of first feature: 0.2562605775630938
              
            

In general: Mean is calculated for each column/feature. Each feature's realization per sample/row is substracted by that feature mean value and finally divided by the feature standard deviation. Depending on the implementation of the standard deviation computation of the library and language used, the results might vary in the normalized dataset. In case of Python, it is important to take Bessel's correction into account. When there is a small amount of samples, this correction counterfeits bias errors in the calculation of the standard deviation. For the Numpy package in Python, the ddof-Parameter is set to 1 here. In Julia, this is the default setting, whereas in Python, the default for ddof is 0. The acronym ddof stands for Delta Degree of Freedom.

              
                data = np.reshape(images, (n_images, n_features))
                mean_vector = np.mean(data, axis=0);
                std_vector = np.std(data, axis=0, ddof=1);

                print("Dataset (samples x features): ", data.shape);
                print("Feature mean vector (1 x features): ", mean_vector.shape);
                print("Feature standard deviation vector (1 x features): ", std_vector.shape);
                print("Mean of first feature: ", mean_vector[0])
                print("Standard deviation of first feature: ", std_vector[0])
              
            

The full program can be executed after checking out the gitlab repository xxx and running the Jupter notebook xxx. The output of the above Python code snippet should look like this:

              
                Dataset (samples x features):  (109, 40000)
                Feature mean vector (1 x features):  (40000,)
                Feature standard deviation vector (1 x features):  (40000,)
                Mean of first feature:  0.6003597769382977
                Standard deviation of first feature:  0.2562605775630938