mydatapreprocessing.feature_engineering package

Extract new features from available data.

You can add new derived columns. This new generated data can help to machine learning models to better results.

In add_derived_columns you can add first and second derivations, multiplication of columns, rolling means and rolling standard deviation.

In add_frequency_columns you can add fast fourier transform results maximums on running window.

mydatapreprocessing.feature_engineering.add_derived_columns(data: pd.DataFrame, differences: bool = True, second_differences: bool = True, multiplications: bool = True, rolling_means: int | None = 10, rolling_stds: int | None = 10, mean_distances: bool = True) → pd.DataFrame[source]

Create many columns with new information about dataset.

Add data like difference, rolling mean or distance from average. Computed columns will be appended to original data. It will process all the columns, so a lot of redundant data will be created. It is necessary do some feature extraction afterwards to remove non-correlated columns.

Note

Length on output is different as rolling windows needs to be prepend before first values.

Parameters:
  • data (pd.DataFrame) – Data that we want to extract more information from.
  • differences (bool, optional) – Compute difference between n and n-1 sample. Defaults to True.
  • second_differences (bool, optional) – Compute second difference. Defaults to True.
  • multiplications (bool, optional) – Column multiplicated with other column. Defaults to True.
  • rolling_means (int | None), optional) – Rolling mean with defined window. Defaults to 10.
  • rolling_stds (int | None) – Rolling std with defined window. Defaults to 10.
  • mean_distances (bool, optional) – Distance from average. Defaults to True.
Returns:

Data with more columns, that can have more information, than original data. Number of rows can be little bit smaller. Data has the same type as input.

Return type:

pd.DataFrame

Example

>>> import mydatapreprocessing as mdp
>>> data = pd.DataFrame(
...     [mdp.datasets.sin(n=30), mdp.datasets.ramp(n=30)]
... ).T
...
>>> extended = add_derived_columns(data, differences=True, rolling_means=10)
>>> extended.columns
Index([                      0,                       1,
              '0 - Difference',        '1 - Difference',
       '0 - Second difference', '1 - Second difference',
        'Multiplicated (0, 1)',      '0 - Rolling mean',
            '1 - Rolling mean',       '0 - Rolling std',
             '1 - Rolling std',     '0 - Mean distance',
           '1 - Mean distance'],
      dtype='object')
>>> len(extended)
21
mydatapreprocessing.feature_engineering.add_frequency_columns(data: pd.DataFrame | np.ndarray, window: int) → pd.DataFrame[source]

Use fourier transform on running window and add it’s maximum and std as new data column.

Parameters:
  • data (pd.DataFrame | np.ndarray) – Data we want to use.
  • window (int) – length of running window.
Returns:

Data with new columns, that contain information of running frequency analysis.

Return type:

pd.DataFrame

Example

>>> import mydatapreprocessing as mdp
>>> data = pd.DataFrame(
...     [mdp.datasets.sin(n=100), mdp.datasets.ramp(n=100)]
... ).T
>>> extended = add_frequency_columns(data, window=32)
mydatapreprocessing.feature_engineering.keep_correlated_data(data: DataFrameOrArrayGeneric, threshold: float = 0.5) → DataFrameOrArrayGeneric[source]

Remove columns that are not correlated enough to predicted columns.

Predicted column is supposed to be 0.

Parameters:
  • data (DataFrameOrArrayGeneric) – Time series data.
  • threshold (float, optional) – After correlation matrix is evaluated, all columns that are correlated less than threshold are deleted. Defaults to 0.5.
Returns:

Data with no columns that are not correlated with predicted column. If input in numpy array, then also output in array, if DataFrame input, then DataFrame output.

Return type:

DataFrameOrArrayGeneric