mydatapreprocessing.preprocessing package

Subpackage for data preprocessing.

Preprocessing means for example standardization, data smoothing, outliers removal or binning.

There are many small functions that you can use separately, but there is main function preprocess_data that call all the functions based on input params for you. For inverse preprocessing use preprocess_data_inverse.

Functions are available for pd.DataFrame as well as numpy array. Output is usually of the same type as an input. Functions can be use inplace or copy can be created.

Note

In many functions, there is main column necessary for correct functioning. It’s supposed, that this column is on index 0 as first column. If using consolidation, use first_column param. Or use move_on_first_column manually.

mydatapreprocessing.preprocessing.preprocess_data(data: DataFrameOrArrayGeneric, config: PreprocessingConfig = <mydatapreprocessing.preprocessing.preprocessing_config.preprocessing_config_internal.PreprocessingConfig object>) → tuple[DataFrameOrArrayGeneric, InversePreprocessingConfig][source]

Main preprocessing function, that call other functions based on configuration.

Mostly for preparing data to be optimal as input into machine learning models.

Parameters:
  • data (DataFrameOrArrayGeneric) – Input data that we want to preprocess.
  • config (PreprocessingConfig) – Configure data preprocessing. It’s documented in preprocessing_config subpackage. You can import and edit default_preprocessing_config. Intellisense and static type analysis should work. Defaults to default_preprocessing_config
Returns:

If input in numpy array, then also output in array, if DataFrame input, then DataFrame output.

Return type:

PreprocessedData

Example

>>> import pandas as pd
>>> import mydatapreprocessing.preprocessing as mdpp
>>> df = pd.DataFrame(
...     np.array([range(5), range(20, 25), np.random.randn(5)]).astype("float32").T,
... )
>>> df.iloc[2, 0] = 500
...
>>> config = mdpp.preprocessing_config.default_preprocessing_config.do.copy()
>>> config.do.update({"remove_outliers": 1, "difference_transform": True, "standardize": "standardize"})

Predicted column moved to index 0, but for test reason test, use different one

>>> processed_df, inverse_preprocessing_config_df = mdpp.preprocess_data(df, config)
>>> processed_df
          0         1         2
1 -0.707107 -0.707107  0.377062
3  1.414214  1.414214  0.991879
4 -0.707107 -0.707107 -1.368941

Inverse preprocessing is done for just one column - the one on index 0 You can use first_column in consolidation to move the column to that index.

>>> inverse_preprocessing_config_df.difference_transform = df.iloc[0, 0]
>>> inverse_processed_df = mdpp.preprocess_data_inverse(
...     processed_df.iloc[:, 0].values, inverse_preprocessing_config_df
... )
>>> np.allclose(inverse_processed_df, np.array([1, 3, 4]))
True
mydatapreprocessing.preprocessing.preprocess_data_inverse(data: pd.DataFrame | np.ndarray, config: InversePreprocessingConfig) → np.ndarray[source]

Undo all data preprocessing to get real data.

Does not inverse all the columns, but only defined one. Only predicted column is also returned. Order is reverse than preprocessing. Output is in numpy array.

Parameters:
  • data (pd.DataFrame | np.ndarray) – One dimension (one column) preprocessed data. Do not use ndim > 1.
  • config (InversePreprocessingConfig) – Data necessary for inverse transformation. It does not need to be configured, but it is returned from preprocess_data.
Returns:

Inverse preprocessed data

Return type:

np.ndarray

Example

>>> import pandas as pd
>>> import numpy as np
>>> import mydatapreprocessing.preprocessing as mdpp
...
>>> df = pd.DataFrame(
...     np.array([range(5), range(20, 25), np.random.randn(5)]).astype("float32").T,
... )
>>> preprocessed, inverse_config = mdpp.preprocess_data(df.values)
>>> preprocessed
array([[-1.4142135 , -1.4142135 ,  0.1004863 ],
       [-0.70710677, -0.70710677,  0.36739323],
       [ 0.        ,  0.        , -1.1725829 ],
       [ 0.70710677,  0.70710677,  1.6235067 ],
       [ 1.4142135 ,  1.4142135 , -0.9188035 ]], dtype=float32)
>>> preprocess_data_inverse(preprocessed[:, 0], inverse_config)
array([0., 1., 2., 3., 4.], dtype=float32)