mydatapreprocessing.consolidation package

Consolidate data. Consolidation means that output is somehow standardized and you know that it will be working in your algorithms even when data are not known beforehand. It includes for example shape verification, string embedding, setting datetime index, resampling or NaN cleaning.

You can consolidate data with consolidate_data and prepare it for example for machine learning models.

There are many small functions that you can use separately in consolidation_functions, but there is main pipeline function consolidate_data that calls all the functions based on config for you.

Functions usually use DataFrame as consolidation is first phase of data preparation and columns names are still important here.

There is an ‘inplace’ parameter on many places. This means, that it change your original data, but syntax is bit different as it will return anyway, so use for example df = consolidation_function(df, inplace=True)

mydatapreprocessing.consolidation.consolidate_data(data: DataFrameOrArrayGeneric, config: mydatapreprocessing.consolidation.consolidation_config.consolidation_config_internal.ConsolidationConfig = <mydatapreprocessing.consolidation.consolidation_config.consolidation_config_internal.ConsolidationConfig object>) → pandas.core.frame.DataFrame[source]

Transform input data in various formats and shapes into data in defined shape.

This can be beneficial for example in machine learning. If you have data in other format than DataFrame, use load_data first.

Note

This function returns only numeric data with default config. All string columns will be removed (use embedding if you need)

Parameters:
  • data (DataFrameOrArrayGeneric) – Input data.
  • config (ConsolidationConfig) – Configure data consolidation. It’s documented in consolidation_config subpackage. You can import and edit default_consolidation_config. Intellisense and static type analysis should work. Defaults to default_consolidation_config
Raises:

KeyError, TypeError – May happen if wrong params. E.g. if predicted column name not found in DataFrame.

Returns:

Data in standardized form.

Return type:

pd.DataFrame

Example

>>> import mydatapreprocessing.consolidation as mdpc
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(2)
...
>>> df = pd.DataFrame(
...    np.array([range(4), range(20, 24), np.random.randn(4)]).T,
...    columns=["Column", "First", "Random"],
... )
...
>>> df.iloc[2, 0] = np.nan
>>> df
   Column  First    Random
0     0.0   20.0 -0.416758
1     1.0   21.0 -0.056267
2     NaN   22.0 -2.136196
3     3.0   23.0  1.640271
>>> config = mdpc.consolidation_config.default_consolidation_config.do.copy()
>>> config.first_column = "First"
>>> config.datetime.datetime_column = None
>>> config.remove_missing_values.remove_all_column_with_nans_threshold = 0.6
...
>>> consolidate_data(df, config)
   First  Column    Random
0   20.0     0.0 -0.416758
1   21.0     1.0 -0.056267
2   22.0     2.0 -2.136196
3   23.0     3.0  1.640271