# Principal Components Analysis (PCA) The analysis of Raman spectral data often makes use of multivariate techniques, with [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) being one of the most widely used. PCA is an unsupervised dimensionality reduction method that can also be used to identify patterns in the data. It transforms the original variables into a new set of uncorrelated axes, known as Principal Components (PCs), which are ordered to capture the greatest variance in the dataset. Each PC is a weighted combination of the original variables, where the weights (called loadings) reflect the contribution of each variable. The transformed coordinates, or scores, represent the original data projected onto these new axes. ## Basic usage With PyFasma, PCA is performed using the {class}`pyfasma.modeling.PCA` class. Let’s assume that the data to be analyzed by PCA are stored in a DataFrame `df`, where the index corresponds to Raman shifts (features) and each column contains the intensity values of a single spectrum (sample). In this format, the DataFrame's shape is `(n_features, n_samples)`. For consistency with the underlying {class}`sklearn.decomposition.PCA` class, which assumes DataFrames of shape `(n_samples, n_features)`, the DataFrame has to be transposed to be used in the {class}`pyfasma.modeling.PCA` class. For the example data in [`examples/all_samples.csv`](https://gitlab.com/lepa22/pyfasma/-/blob/main/examples/all_samples.csv), PCA is performed using the following code: ```python import pyfasma.modeling as mdl hue = [ 'Healthy' if 'Rbhf' in col else 'Osteoporotic' for col in df.columns ] pca = mdl.PCA(df.T, hue=hue) ``` Here, `hue` assigns a class label to each sample based on its column name in the `df` DataFrame. Each class is then plotted in a different color. In this example, spectra from healthy rabbits are identified by the substring `"Rbhf"` and are labeled as `"Healthy"`, while spectra without it (e.g., containing `"Rbof"`) are labeled `"Osteoporotic"`. :::{tip} Although the `hue` list could be manually defined, using a list comprehension with conditionals makes class assignment both flexible and easy to adapt to different naming conventions. ::: :::{warning} The data in [`examples/all_samples.csv`](https://gitlab.com/lepa22/pyfasma/-/blob/main/examples/all_samples.csv) represent only a subset of those used in the [PyFasma paper](https://doi.org/10.1039/D5AN00452G). As a result, the PCA results here will differ slightly from those in the published paper. ::: The code above creates a `pca` object that provides several attributes along with a PCA summary plot: [![PCA summary plot][1a]][1b] This summary plot includes: - A scree plot for the first 10 principal components (PCs) (the number can be adjusted by using the `n_components` parameter upon class initialization). - Loadings plots for the first three PCs. - Scores plots for the first three PCs, shown in three ways: - Scatter plots of PC pairs with 95% confidence ellipses for each class (lower off-diagonal). - 2D density plots of PC pairs (upper off-diagonal). - 1D density plots for each PC (diagonal). By default, the summary plot is shown when the `pca` object is created. To disable it, pass `summary=False` during initialization. ## Class parameters The {class}`pyfasma.modeling.PCA` class accepts both PyFasma-specific parameters and all parameters of {class}`sklearn.decomposition.PCA`. ### PyFasma parameters - **n_components** *({int, float, 'mle', None}, optional)* - Number of principal components to compute and display in the summary plot. - **hue** *(list or None, optional)* - Class labels for coloring samples in scores plots. - **summary** *(bool, default=True)* - Whether to display the summary plot automatically when the object is created. ### Additional parameters Any extra keyword arguments are passed directly to {class}`sklearn.decomposition.PCA` (e.g., `svd_solver`, `random_state`, `whiten`). :::{note} For more details refer to the {class}`pyfasma.modeling.PCA` documentation. ::: ## PCA plots The PCA summary plot and all of its individual components can be accessed directly from the `pca` object. Each plot is created with Matplotlib and Seaborn, accepts commonly used keyword arguments for customization, and returns `fig` and `ax` objects so you can further adjust or save the figure. In addition to the plots included in the summary, several additional plots are also available for more detailed exploration. ### Scree plots The scree plot summarizes how much variance is explained by each principal component. It is useful for deciding how many components to retain by identifying the point where additional components have small contribution to the total variance (the "elbow"). The following code produces the default scree plot, showing both the explained variance ratio and the cumulative explained variance ratio: ```python fig, ax = pca.scree_plot() ``` [![Default scree plot][2a]][2b] The user has the option to show either the explained variance ratio or the cumulative explained variance ratio using the `show_line` option, which can either be `'all'` (default) for showing both the explained variance ratio and the cumulative explained variance ratio, `'exp_var_rat'` for showing only the explained variance ratio, or `'cum_exp_var_rat'` for showing only the cumulative explained variance ratio. The following code produces a scree plot containing only the explained variance ratio. ```python fig, ax = pca.scree_plot(show_line='exp_var_rat') ``` [![Scree plot with explained variance ratio only][3a]][3b] :::{note} For more details and customization options refer to the {meth}`pyfasma.modeling.PCA.scree_plot` documentation. ::: ### Scores plots Scores plots display the projection of each sample onto the principal components. They show how samples relate to each other in the reduced-dimensional space and can reveal clustering, trends, or outliers. They can be displayed as scatter plots, 2D density plots, or 1D density plots along each axis. #### Scatter plots The most basic scores scatter plot can be created using the `pca` object and the following code: ```python fig, ax = pca.scores_plot() ``` [![A basic scores scatter plot][4a]][4b] Since no arguments are given, PC2 is plotted against PC1 with default colors and symbols. Additionally, the 95% confidence ellipses are plotted for each class. In addition to the standard customization options, the {meth}`pyfasma.modeling.PCA.scores_plot` method provides several plot-specific parameters. The principal components shown on the x- and y-axes can be selected with `xpc` and `ypc`. The percentage of explained variance can be toggled with `show_percent`. Confidence ellipses can be shown or hidden using `ellipse` and further customized with parameters of the form `ellipse_*`. Likewise, sample annotations can be enabled with `annotate` and customized with `annotate_*` options. For example, the code below creates a scores plot of PC3 against PC2, with explained variance percentages hidden, ellipses disabled, and point labels enabled: ```python fig, ax = pca.scores_plot( xpc=2, ypc=3, show_percent=False, ellipse=False, annotate=True ) ``` The result is the following plot: [![Annotated scores scatter plot of PC2-PC3][5a]][5b] From the previous plot it's obvious that with many overlapping points, annotated scores plots can quickly become cluttered. Nevertheless, annotations are useful when you need to identify which samples correspond to specific points. The appearance can be improved by adjusting the annotation settings. For example, the code below reproduces the same plot as above, but restricted to the region PC2 < 0 and PC3 > 0 (this region is not of particular interest and is used only for demonstration). Annotations are drawn with a smaller font size and lighter color to reduce visual clutter. ```python fig, ax = pca.scores_plot( xpc=2, ypc=3, show_percent=False, ellipse=False, annotate=True, annotate_fontsize=8, annotate_kwargs=dict(color='gray'), xlim=[None, 0], ylim=[0, None] ) ``` The result is the following plot: [![Annotated scores scatter plot of PC2-PC3 zoomed][6a]][6b] :::{note} For more details and customization options refer to the {meth}`pyfasma.modeling.PCA.scores_plot` documentation. ::: #### 2D density plots #### 1D density plots ### Loadings plots Loadings plots show how strongly each feature (Raman shift) contributes to a principal component. Large positive or negative loadings indicate spectral regions that have the greatest influence on that component, which can help interpret the sources of variation in the data and, in some cases, explain differences between sample groups. ## PCA results as DataFrames ## See also [1a]: ../../examples/pca_summary_plot_100dpi.png "PCA summary plot" [1b]: ../../examples/pca_summary_plot_300dpi.png "PCA summary plot" [2a]: ../../examples/scree_plot_100dpi.png "Default scree plot" [2b]: ../../examples/scree_plot_300dpi.png "Default scree plot" [3a]: ../../examples/scree_plot_exp_var_rat_100dpi.png "Scree plot with explained variance ratio only" [3b]: ../../examples/scree_plot_exp_var_rat_300dpi.png "Scree plot with explained variance ratio only" [4a]: ../../examples/basic_scores_scatter_plot_100dpi.png "A basic scores scatter plot" [4b]: ../../examples/basic_scores_scatter_plot_300dpi.png "A basic scores scatter plot" [5a]: ../../examples/pc2-pc3_scores_scatter_plot_w_annotation_100dpi.png "Annotated scores scatter plot of PC2-PC3" [5b]: ../../examples/pc2-pc3_scores_scatter_plot_w_annotation_300dpi.png "Annotated scores scatter plot of PC2-PC3" [6a]: ../../examples/pc2-pc3_scores_scatter_plot_w_annotation_zoomed_100dpi.png "Annotated scores scatter plot of PC2-PC3 zoomed" [6b]: ../../examples/pc2-pc3_scores_scatter_plot_w_annotation_zoomed_300dpi.png "Annotated scores scatter plot of PC2-PC3 zoomed"