Preprocessing

Raman data typically require preprocessing prior to analysis to ensure consistency and comparability across spectra. Common preprocessing steps include cropping, despiking, smoothing, baseline correction, and normalization. These operations help to reduce noise, correct artifacts, and standardize the spectral data.

In PyFasma, all preprocessing functions are implemented in the numpyfuncs module and made available for use on pandas DataFrames using the dffuncs module. These functions can be accessed via the .pyfasma accessor attached to the DataFrames. The general syntax for applying a preprocessing method to a DataFrame df is:

df.pyfasma.<method>

where <method> is the preprocessing method to be applied.

To enable the .pyfasma accessor, one has to first import the dffuncs module:

from pyfasma import dffuncs

Note

The DataFrame must be structured with Raman shift values as the index and spectra as columns.

Cropping

Cropping is typically the first preprocessing step and involves selecting a specific spectral range of interest by removing values outside a defined wavenumber interval. This step reduces data size and focuses the analysis on the most informative regions, such as the fingerprint or high-wavenumber regions in Raman spectra.

In PyFasma, cropping is performed using the df.pyfasma.crop method, available through the .pyfasma accessor. The function selects only the rows (wavenumbers) within the specified range.

Example

  • The following command can be used to crop the spectra, retaining only the region between 400 and 1800 cm-1:

    df_cropped = df.pyfasma.crop(xrange=[400, 1800])
    

The method is non-destructive and returns a new DataFrame containing only the selected spectral region.

Despiking

Despiking removes sharp, isolated intensity artifacts from Raman spectra, often caused by cosmic rays or detector noise. These spikes can interfere with baseline correction, normalization, and subsequent analysis, so they are typically removed early in the preprocessing pipeline.

In PyFasma, despiking is performed using the df.pyfasma.despike method, available through the .pyfasma accessor. Internally, it uses scipy.signal.find_peaks() to detect and remove peaks based on customizable parameters such as height, prominence, and width.

Examples

  • The following command removes both positive and negative spikes using default settings:

    df_despiked = df.pyfasma.despike()
    
  • To only remove strong positive spikes with a minimum height of 500:

    df_despiked = df.pyfasma.despike(spikes_type="pos", height=500)
    

The method is non-destructive and returns a new DataFrame with spikes removed from each spectrum (column).

See full parameter list and details

For more control over the spike detection criteria, refer to the full method documentation:
pyfasma.dffuncs.PyfasmaAccessor.despike()

Smoothing

Smoothing reduces noise in Raman spectra by averaging signal fluctuations while preserving important features like peaks. It is a common preprocessing step before baseline correction or peak analysis, especially when spectra exhibit high-frequency noise.

In PyFasma, smoothing is performed using the df.pyfasma.smooth method, available through the .pyfasma accessor. The function supports several smoothing algorithms, including Savitzky-Golay, moving average, and Gaussian filtering. The smoothing behavior is controlled by the kind and params arguments.

Examples

  • Apply a Savitzky-Golay filter with a window length of 11 and a polynomial order of 3:

    df_smooth = df.pyfasma.smooth(params=[11, 3], kind="savgol")
    
  • Apply a simple moving average with a window length of 15:

    df_smooth = df.pyfasma.smooth(params=[15], kind="movav")
    
  • Apply Gaussian smoothing with a standard deviation of 2.5:

    df_smooth = df.pyfasma.smooth(params=[2.5], kind="gauss")
    

The method is non-destructive and returns a new DataFrame with smoothed spectra. The choice of filter and its parameters should depend on the level and nature of noise in your data.

See full parameter list and details

For all supported smoothing methods and their required parameters, refer to the full method documentation:
pyfasma.dffuncs.PyfasmaAccessor.smooth()

Baseline Correction

Baseline correction removes background signal from Raman spectra, often caused by fluorescence or system artifacts. This step is essential for accurate peak detection and quantification.

In PyFasma, baseline correction is performed using the df.pyfasma.baseline_correct method, available through the .pyfasma accessor. Internally, it leverages the pybaselines library by Donald Erb and supports several popular algorithms, including IModPoly [1], SNIP [2], and airPLS[3].

Examples

  • Apply the default SNIP algorithm:

    df_corrected = df.pyfasma.baseline_correct()
    
  • Use the IModPoly method with a second-order polynomial:

    df_corrected = df.pyfasma.baseline_correct(kind="imodpoly", poly_order=2)
    
  • Apply the airPLS method with a smoother baseline (higher lam):

    df_corrected = df.pyfasma.baseline_correct(kind="airpls", lam=1e7)
    

The method is non-destructive and returns a new DataFrame with the estimated baseline removed from each spectrum. By default, PyFasma also vertically shifts the baseline to avoid clipping the signal (zero_correction=True), ensuring it remains below the input spectrum.

See full parameter list and details

Each algorithm accepts its own set of fine-tuning parameters. For advanced usage, refer to the full method documentation:
pyfasma.dffuncs.PyfasmaAccessor.baseline_correct()

References

  • [1] Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

  • [2] Morháč, M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments and Methods in Physics Research A, 2009, 60, 478-487.

  • [3] Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.

Normalization

Normalization scales Raman spectra to make them comparable across samples, regardless of their absolute intensities. This is essential for preprocessing pipelines where differences in overall intensity could obscure relevant spectral features.

In PyFasma, normalization is performed using the df.pyfasma.normalize method, available through the .pyfasma accessor. Multiple normalization strategies are supported, such as max-intensity, area under curve, L1/L2 norms, min-max scaling, and mean absolute deviation (MAD). For intensity and area-based normalization, the spectral x-axis must be defined.

Examples

  • Normalize each spectrum by its maximum value:

    df_norm = df.pyfasma.normalize(kind="intensity")
    
  • Normalize based on the area under the curve between 100 and 1800 cm{sup}-1:

    df_norm = df.pyfasma.normalize(kind="area", xrange=[100, 1800])
    
  • Normalize each spectrum using the L2 norm (Euclidean norm):

    df_norm = df.pyfasma.normalize(kind="l2")
    

The method is non-destructive and returns a new DataFrame with normalized columns.

See full parameter list and details

For all normalization options, including those requiring a spectral axis (x), refer to the full method documentation:
pyfasma.dffuncs.PyfasmaAccessor.normalize()

Interpolation

Interpolation resamples Raman spectra to a new set of x-values. This is often necessary when combining spectra measured at slightly different spectral points, or when aligning data onto a common wavenumber axis for downstream analysis. Interpolation estimates the y-values (intensities) at new x-locations using a mathematical function fitted to the original data.

In PyFasma, interpolation is performed using the df.pyfasma.interpolate method, available through the .pyfasma accessor. It supports linear and cubic interpolation methods, depending on the desired trade-off between speed and smoothness.

Examples

  • Interpolate to a new axis with 1-unit steps using cubic interpolation (default):

    new_x = np.arange(100, 1800, 1)
    df_interp = df.pyfasma.interpolate(xnew=new_x)
    
  • Use linear interpolation for faster performance:

    df_interp = df.pyfasma.interpolate(xnew=new_x, kind="linear")
    

The method is non-destructive and returns a new DataFrame interpolated to the values in xnew.

See full parameter list and details

For additional options, refer to the full method documentation:
pyfasma.dffuncs.PyfasmaAccessor.interpolate()