# Preprocessing Raman data typically require preprocessing prior to analysis to ensure consistency and comparability across spectra. Common preprocessing steps include cropping, despiking, smoothing, baseline correction, and normalization. These operations help to reduce noise, correct artifacts, and standardize the spectral data. In PyFasma, all preprocessing functions are implemented in the {mod}`numpyfuncs ` module and made available for use on pandas DataFrames using the {mod}`dffuncs ` module. These functions can be accessed via the `.pyfasma` accessor attached to the DataFrames. The general syntax for applying a preprocessing method to a DataFrame `df` is: ``` df.pyfasma. ``` where `` is the preprocessing method to be applied. To enable the `.pyfasma` accessor, one has to first import the {mod}`dffuncs ` module: ``` from pyfasma import dffuncs ``` :::{note} The DataFrame must be structured with Raman shift values as the index and spectra as columns. ::: ## Cropping Cropping is typically the first preprocessing step and involves selecting a specific spectral range of interest by removing values outside a defined wavenumber interval. This step reduces data size and focuses the analysis on the most informative regions, such as the fingerprint or high-wavenumber regions in Raman spectra. In PyFasma, cropping is performed using the {func}`df.pyfasma.crop ` method, available through the `.pyfasma` accessor. The function selects only the rows (wavenumbers) within the specified range. ### Example - The following command can be used to crop the spectra, retaining only the region between 400 and 1800 cm{sup}`-1`: ```python df_cropped = df.pyfasma.crop(xrange=[400, 1800]) ``` The method is non-destructive and returns a new DataFrame containing only the selected spectral region. ## Despiking Despiking removes sharp, isolated intensity artifacts from Raman spectra, often caused by cosmic rays or detector noise. These spikes can interfere with baseline correction, normalization, and subsequent analysis, so they are typically removed early in the preprocessing pipeline. In PyFasma, despiking is performed using the {func}`df.pyfasma.despike ` method, available through the `.pyfasma` accessor. Internally, it uses {func}`scipy.signal.find_peaks` to detect and remove peaks based on customizable parameters such as height, prominence, and width. ### Examples - The following command removes both positive and negative spikes using default settings: ```python df_despiked = df.pyfasma.despike() ``` - To only remove strong positive spikes with a minimum height of 500: ```python df_despiked = df.pyfasma.despike(spikes_type="pos", height=500) ``` The method is non-destructive and returns a new DataFrame with spikes removed from each spectrum (column). **See full parameter list and details** For more control over the spike detection criteria, refer to the full method documentation: {func}`pyfasma.dffuncs.PyfasmaAccessor.despike` ## Smoothing Smoothing reduces noise in Raman spectra by averaging signal fluctuations while preserving important features like peaks. It is a common preprocessing step before baseline correction or peak analysis, especially when spectra exhibit high-frequency noise. In PyFasma, smoothing is performed using the {func}`df.pyfasma.smooth ` method, available through the `.pyfasma` accessor. The function supports several smoothing algorithms, including Savitzky-Golay, moving average, and Gaussian filtering. The smoothing behavior is controlled by the `kind` and `params` arguments. ### Examples - Apply a Savitzky-Golay filter with a window length of 11 and a polynomial order of 3: ```python df_smooth = df.pyfasma.smooth(params=[11, 3], kind="savgol") ``` - Apply a simple moving average with a window length of 15: ```python df_smooth = df.pyfasma.smooth(params=[15], kind="movav") ``` - Apply Gaussian smoothing with a standard deviation of 2.5: ```python df_smooth = df.pyfasma.smooth(params=[2.5], kind="gauss") ``` The method is non-destructive and returns a new DataFrame with smoothed spectra. The choice of filter and its parameters should depend on the level and nature of noise in your data. **See full parameter list and details** For all supported smoothing methods and their required parameters, refer to the full method documentation: {func}`pyfasma.dffuncs.PyfasmaAccessor.smooth` ## Baseline Correction Baseline correction removes background signal from Raman spectra, often caused by fluorescence or system artifacts. This step is essential for accurate peak detection and quantification. In PyFasma, baseline correction is performed using the {func}`df.pyfasma.baseline_correct ` method, available through the `.pyfasma` accessor. Internally, it leverages the [pybaselines](https://pybaselines.readthedocs.io/en/latest/) library by Donald Erb and supports several popular algorithms, including IModPoly [1], SNIP [2], and airPLS[3]. ### Examples - Apply the default SNIP algorithm: ```python df_corrected = df.pyfasma.baseline_correct() ``` - Use the IModPoly method with a second-order polynomial: ```python df_corrected = df.pyfasma.baseline_correct(kind="imodpoly", poly_order=2) ``` - Apply the airPLS method with a smoother baseline (higher `lam`): ```python df_corrected = df.pyfasma.baseline_correct(kind="airpls", lam=1e7) ``` The method is non-destructive and returns a new DataFrame with the estimated baseline removed from each spectrum. By default, PyFasma also vertically shifts the baseline to avoid clipping the signal (`zero_correction=True`), ensuring it remains below the input spectrum. **See full parameter list and details** Each algorithm accepts its own set of fine-tuning parameters. For advanced usage, refer to the full method documentation: {func}`pyfasma.dffuncs.PyfasmaAccessor.baseline_correct` **References** - [1] Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232. - [2] Morháč, M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments and Methods in Physics Research A, 2009, 60, 478-487. - [3] Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146. ## Normalization Normalization scales Raman spectra to make them comparable across samples, regardless of their absolute intensities. This is essential for preprocessing pipelines where differences in overall intensity could obscure relevant spectral features. In PyFasma, normalization is performed using the {func}`df.pyfasma.normalize ` method, available through the `.pyfasma` accessor. Multiple normalization strategies are supported, such as max-intensity, area under curve, L1/L2 norms, min-max scaling, and mean absolute deviation (MAD). For intensity and area-based normalization, the spectral x-axis must be defined. ### Examples - Normalize each spectrum by its maximum value: ```python df_norm = df.pyfasma.normalize(kind="intensity") ``` - Normalize based on the area under the curve between 100 and 1800 cm{sup}-1: ```python df_norm = df.pyfasma.normalize(kind="area", xrange=[100, 1800]) ``` - Normalize each spectrum using the L2 norm (Euclidean norm): ```python df_norm = df.pyfasma.normalize(kind="l2") ``` The method is non-destructive and returns a new DataFrame with normalized columns. **See full parameter list and details** For all normalization options, including those requiring a spectral axis (x), refer to the full method documentation: {func}`pyfasma.dffuncs.PyfasmaAccessor.normalize` ## Interpolation Interpolation resamples Raman spectra to a new set of x-values. This is often necessary when combining spectra measured at slightly different spectral points, or when aligning data onto a common wavenumber axis for downstream analysis. Interpolation estimates the y-values (intensities) at new x-locations using a mathematical function fitted to the original data. In PyFasma, interpolation is performed using the {func}`df.pyfasma.interpolate ` method, available through the `.pyfasma` accessor. It supports linear and cubic interpolation methods, depending on the desired trade-off between speed and smoothness. ### Examples - Interpolate to a new axis with 1-unit steps using cubic interpolation (default): ```python new_x = np.arange(100, 1800, 1) df_interp = df.pyfasma.interpolate(xnew=new_x) ``` - Use linear interpolation for faster performance: ```python df_interp = df.pyfasma.interpolate(xnew=new_x, kind="linear") ``` The method is non-destructive and returns a new DataFrame interpolated to the values in `xnew`. **See full parameter list and details** For additional options, refer to the full method documentation: {func}`pyfasma.dffuncs.PyfasmaAccessor.interpolate`