dffuncs module

class pyfasma.dffuncs.PyfasmaAccessor(pandas_obj)

Bases: object

Accessor for pyfasma methods on pandas DataFrames and Series.

Access via df.pyfasma.method() or series.pyfasma.method().

baseline_correct(kind='snip', zero_correction=False, **params) pandas.DataFrame

Apply baseline correction to the columns of the DataFrame.

This method uses some of the most popular and efficient baseline correction algorithms, provided by the pybaselines project: https://pybaselines.readthedocs.io/en/latest/

Parameters:
  • kind ({'imodpoly', 'snip', 'airpls'}, optional) – The type of baseline correction to be applied. - ‘imodpoly’: apply baseline correction using the Improved Modified Polynomial (IModPoly) algorithm [1]. - ‘snip’: apply baseline correction using the Statistics-sensitive Non-linear Iterative Peak-clipping (SNIP) algorithm [2]. - ‘airpls’: apply baseline correction using the Adaptive Iteratively Reweighted Penalized Least Squares (airPLS) algorithm [3].

  • zero_correction (bool) – If True (default), vertically offset the estimated baseline so that it lies below the input array.

  • **params – Parameters for the selected baseline correction algorithm. Each algorithm has its own set of parameters that allow for finely adjusting the baseline. To keep things simple and focus on ease of use, here we will only document the most important parameters that affect baseline estimation for each algorithm. In most cases, there should be no need to adjust any of the other parameters. For all available parameters and their documentation, the user is required to consult the respective section of pybaselines: https://pybaselines.readthedocs.io/en/latest/ - ‘imodpoly’ : x_data : np.ndarray, optional The x-values of the measured data. Default is None, which will create an array from -1 to 1 with the same number of points as y. poly_order : int The polynomial order for fitting the baseline. Default is 2. num_std : float The number of standard deviations to include when thresholding. Default is 1. See more in: https://pybaselines.readthedocs.io/en/latest/api/pybaselines/polynomial/index.html#pybaselines.polynomial.imodpoly - ‘snip’ : max_half_window : int or Sequence(int, int) The maximum number of iterations. Should be set such that max_half_window is approxiamtely (w-1)/2, where w is the index-based width of a feature or peak. max_half_window can also be a sequence of two integers for asymmetric peaks, with the first item corresponding to the max_half_window of the peak’s left edge, and the second item for the peak’s right edge. Default is None, which will use the output from pybaselines.utils.optimize_window(), which is an okay starting value. decreasing : bool If False (default), will iterate through window sizes from 1 to max_half_window. If True, will reverse the order and iterate from max_half_window to 1, which gives a smoother baseline. See more in: https://pybaselines.readthedocs.io/en/latest/api/pybaselines/smooth/index.html#pybaselines.smooth.snip - ‘airpls’ : lam : float The smoothing parameter. Larger values will create smoother baselines. Default is 1e6. See more in: https://pybaselines.readthedocs.io/en/latest/api/pybaselines/whittaker/index.html#pybaselines.whittaker.airpls

Returns:

The baseline-corrected DataFrame.

Return type:

pd.DataFrame

References

[1] Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

[2] Morháč, M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments and Methods in Physics Research A, 2009, 60, 478-487.

[3] Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.

crop(xrange=[None, None]) pandas.DataFrame

Crop the DataFrame at the specified x range. In this concept, x is the index of the DataFrame, which must be in ascending order and must contain only numerical values.

Parameters:

xrange (list) – The x (index) values range that the DataFrame is cropped at. Must be a list of two values: xrange=[start, end]. The first value is the start value and the second is the end value of the crop range. If the start or end values are None, they are set equal to the first and last value of x, respectively.

Returns:

The cropped DataFrame.

Return type:

DataFrame

despike(spikes_type='all', height=None, threshold=None, distance=None, prominence=None, width=None, wlen=None, rel_height=0.5, plateau_size=None) pandas.DataFrame

Remove spikes from the columns of the DataFrame. The method is based on scipy.signal.find_peaks.

Parameters:
  • spikes_type ({'pos', 'neg', 'all'}, optional) – The type of spikes to be removed based on their amplitude. If ‘pos’, only remove positive spikes, i.e. spikes that present an increased amplitude. If ‘neg’, only remove negative spikes, i.e. spikes that present a decreased amplitude. If ‘all’ (default), remove both positive and negative spikes.

  • height (number or ndarray or sequence, optional) – Required height of peaks. Either a number, None, an array matching x or a 2-element sequence of the former. The first element is always interpreted as the minimal and the second, if supplied, as the maximal required height.

  • threshold (number or ndarray or sequence, optional) – Required threshold of peaks, the vertical distance to its neighboring samples. Either a number, None, an array matching x or a 2-element sequence of the former. The first element is always interpreted as the minimal and the second, if supplied, as the maximal required threshold.

  • distance (number, optional) – Required minimal horizontal distance (>= 1) in samples between neighbouring peaks. Smaller peaks are removed first until the condition is fulfilled for all remaining peaks.

  • prominence (number or ndarray or sequence, optional) – Required prominence of peaks. Either a number, None, an array matching x or a 2-element sequence of the former. The first element is always interpreted as the minimal and the second, if supplied, as the maximal required prominence.

  • width (number or ndarray or sequence, optional) – Required width of peaks in samples. Either a number, None, an array matching x or a 2-element sequence of the former. The first element is always interpreted as the minimal and the second, if supplied, as the maximal required width.

  • wlen (int, optional) – Used for calculation of the peaks prominences, thus it is only used if one of the arguments prominence or width is given. See argument wlen in peak_prominences for a full description of its effects.

  • rel_height (float, optional) – Used for calculation of the peaks width, thus it is only used if width is given. See argument rel_height in peak_widths for a full description of its effects.

  • plateau_size (number or ndarray or sequence, optional) – Required size of the flat top of peaks in samples. Either a number, None, an array matching x or a 2-element sequence of the former. The first element is always interpreted as the minimal and the second, if supplied as the maximal required plateau size.

Returns:

The DataFrame with despiked columns.

Return type:

DataFrame

differentiate(deriv=1, smoothing=None, params=None) pandas.DataFrame

Apply differentiation to the columns of the DataFrame. The values of each column are used to calculate dy, while the values of the DataFrame’s index are used to calculate dx.

Parameters:
  • deriv (int, optional) – Order of derivative to calculate.

  • smoothing (str, optional) – Smoothing filter to apply to the derivative. Can be one of the following: - ‘savgol’: applies a Savitzky-Golay filter. - ‘movav’: applies a moving average filter. - ‘gauss’: applies a 1-D Gaussian filter. If None (default), no filter will be applied.

  • params (list, optional) – The list of parameters to be used for the filtering method. The parameters depend on the selection of smoothing. - smoothing=’savgol’: requires two parameters: window_length (int), polyorder (int) window_length is the length of the filter window. It must be less than or equal to the size of x. polyorder is the order of the polynomial used to fit the samples. It must be less than window_length. - smoothing=’movav’: requires one parameter: window_length (int) window_length is the length of the filter window. It must be less than or equal to the size of x. Essentially, the moving average filter is a Savitzky-Golay filter of polyorder=0. - smoothing=’gauss’: requires one parameter: sigma (scalar) sigma is the standard deviation of the Gaussian kernel.

Returns:

The differentiated DataFrame.

Return type:

pd.DataFrame

integrate(xrange=None) pandas.DataFrame

Calculate the integrals of the columns of a DataFrame between the specified x range. The integrals are calculated using Simpson’s rule.

Parameters:

xrange (list, optional) – A list of two numbers ([start, end]) that specify the range of array x that corresponds to the lower and upper limit of the integration, respectively. If None or if the start or end values of the xrange list are None, they are set equal to the first and last value of x, respectively.

Returns:

The integrals of each DataFrame’s column.

Return type:

pd.DataFrame

interpolate(xnew, kind='cubic')

Interpolate the columns of the DataFrame at the points of the array xnew. The values of the columns of the DataFrame correspond to the values of the DataFrame’s index and the values of xnew become the new index. Both the index of the DataFrame and xnew must have numerical values in ascending order.

Parameters:
  • xnew (np.ndarray) – Array of new x points that the DataFrame’s columns are interpolated at. Must be in ascending order.

  • kind ({'linear', 'cubic'}, optional) – The type of interpolation to be applied. If ‘linear’, a linear function is used to estimate the interpolated values of y. If ‘cubic’ (default), a 3rd degree polynomial is used to estimated the interpolated values of y. ‘linear’ is faster but may have small deviations from the original points (usually negligible). ‘cubic’ is more accurate but is more computationally expensive.

Returns:

The interpolated DataFrame.

Return type:

pd.DataFrame

normalize(kind='intensity', xrange=None, xval=None) pandas.DataFrame

Normalize the columns of the DataFrame.

Parameters:
  • kind ({'intensity', 'area', 'l1', 'l2', 'minmax', 'mad'}, optional) –

    The type of normalization to use. - ‘intensity’: Normalize the array to the maximum value of the x array range or to the x array value. The case depends on the selection of the xrange or xval parameter, respectively. The option requires a sorted array x that the y array values correspond to. - ‘area’: Normalize the array to the value of the integral between the specified range of an array x. The option requires a sorted array x that the y array values correspond to. - ‘l1’: Normalize the array using the L1 norm. For a one-dimensional array (vector) y, the L1 norm is given by the sum of the absolute values of the array’s elements. - ‘l2’: Normalize the array using the L2 norm (also known as Euclidean norm). For a one-dimensional array (vector) y, the L2 norm is given by the square root of the sum of the squared elements of the array. - ‘mad’: Normalize the values of the array by the mean absolute deviation (MAD) of the array. MAD is given by:

    \[\text{MAD} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \mu|\]

    where \(x_i\) is each data point, \(\mu\) is the mean of the data, and \(n\) is the number of data points.

  • xrange (list, optional) – This option has slightly different meaning depending on whether kind=’intensity’ or kind=’area’ is used. It has no effect in all other cases. - If used with kind=’intensity’: A list of two numbers ([start, end]) that specifies the range of array x whose corresponding maximum value of the y array is used as the normalization coefficient of the y array. If the start or end values are None, they are set equal to the first and last value of x, respectively. Note that xrange and xval are mutually exclusive, so only one can be not None at a time. - If used with kind=’area’: A list of two numbers ([start, end]) that specify the range of array x that corresponds to the lower and upper limit of the integration, respectively. If the start or end values are None, they are set equal to the first and last value of x, respectively.

  • xval ((int, float), optional) – The value of x, the corresponding y array value of which is used as the normalization coefficient. If the value of xval is not exactly equal to a value of the x array, the x array value right before of xval will be used. Note that xrange and xval are mutually exclusive, so only one can be not None at a time. This option has effect only if kind=’intensity’ is used.

Returns:

The normalized DataFrame.

Return type:

pd.DataFrame

smooth(params: list, kind='savgol') pandas.DataFrame

Apply a smoothing filter to the columns of the DataFrame.

Parameters:
  • params (list) – The list of parameters to be used for the filtering method. The parameters depend on the selection of kind. - ‘savgol’: requires two parameters: window_length (int), polyorder (int) window_length is the length of the filter window. It must be less than or equal to the length of the DataFrame’s column to which the filter is applied. polyorder is the order of the polynomial used to fit the samples. It must be less than window_length. - ‘movav’: requires one parameter: window_length (int) window_length is the length of the filter window. It must be less than or equal to the size of y. Essentially, the moving average filter is a Savitzky-Golay filter of polyorder=0. - ‘gauss’: requires one parameter: sigma (scalar) sigma is the standard deviation of the Gaussian kernel.

  • kind ({'savgol', 'movav', 'median', 'gauss'}, optional) – The type of smoothing filter to be applied to the data. - ‘savgol’: applies a Savitzky-Golay filter. - ‘movav’: applies a moving average filter. - ‘gauss’: applies a 1-D Gaussian filter.

Raises:

ValueError – If kind is not one of the accepted values. If the length of the params list is not right. Depends on the selection of kind.

Returns:

The DataFrame with smoothed columns.

Return type:

DataFrame

pyfasma.dffuncs.merge_df_list(df_list: list, columns=None, how='inner', interpolate=False, xnew=None, kind='cubic') pandas.DataFrame

Merge the DataFrames in the list. Merging is based on the DataFrames’ indices.

Parameters:
  • df_list (list) – List containing pandas DataFrames to be merged.

  • columns (list, optional) – List of names for the columns of the pandas DataFrames. If None (default), the column names will not be changed.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, optional) – Type of merge to be performed. - left: use only keys from left frame, similar to a SQL left outer join; preserve key order. - right: use only keys from right frame, similar to a SQL right outer join; preserve key order. - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. - cross: creates the cartesian product from both frames, preserves the order of the left keys.

  • interpolate (bool, optional) – If True, interpolate the columns of the DataFrame at the points of the array xnew. The values of the columns of the DataFrame correspond to the values of the DataFrame’s index and the values of xnew become the new index. Both the index of the DataFrame and xnew must have numerical values in ascending order.

  • xnew (np.ndarray, optional) – Array of new x points that the DataFrame’s columns are interpolated at. Must be in ascending order. This parameter only has effect if interpolate=True.

  • kind ({'linear', 'cubic'}, optional) – The type of interpolation to be applied. If ‘linear’, a linear function is used to estimate the interpolated values of y. If ‘cubic’ (default), a 3rd degree polynomial is used to estimated the interpolated values of y. ‘linear’ is faster but may have small deviations from the original points (usually negligible). ‘cubic’ is more accurate but is more computationally expensive. This parameter only has effect if interpolate=True.

Raises:

ValueError – If not all items in df_list are of pd.DataFrame type. If xnew is None when interpolate=True. If xnew is not of type np.ndarray or list when interpolate=True.

Returns:

A DataFrame of the merged DataFrames in the list.

Return type:

pd.DataFrame