mydatapreprocessing.consolidation.consolidation_functions package

Functions that are usually used from ‘consolidation_pipeline’. Of course you can use it separately.

mydatapreprocessing.consolidation.consolidation_functions.categorical_embedding(data: pandas.core.frame.DataFrame, embedding: typing_extensions.Literal['label', 'one-hot'][label, one-hot] = 'label', unique_threshold: Union[float, int, numpy.number] = 0.6, inplace=False) → pandas.core.frame.DataFrame[source]

Transform string categories such as ‘US’, ‘FR’ into numeric values.

This is necessary for example in machine learnings models.

Parameters:
  • data (pd.DataFrame) – Data with string (pandas Object dtype) columns.
  • embedding ("label", "one-hot", optional) – ‘label’ or ‘one-hot’. Categorical encoding. Create numbers from strings. ‘label’ give each category (unique string) concrete number. Result will have same number of columns. ‘one-hot’ create for every category new column. Only columns, where are strings repeating (unique_threshold) will be used. Defaults to “label”.
  • unique_threshold (Numeric, optional) – Remove string columns, that have to many categories (ids, hashes etc.). E.g 0.9 defines that in column of length 100, max number of categories to not to be deleted is 10 (90% non unique repeating values). Defaults to 0.6. Min is 0, max is 1. Defaults is 0.6.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns:

DataFrame where string columns transformed to numeric.

Return type:

pd.DataFrame

Raises:

TypeError – If there is unhashable object in values for example.

Example

>>> df = pd.DataFrame(["One", "Two", "One", "Three", "One"])
>>> categorical_embedding(df, embedding="label", unique_threshold=0.1)
   0
0  0
1  2
2  0
3  1
4  0
>>> categorical_embedding(df, embedding="one-hot", unique_threshold=0.1)
   One  Three  Two
0    1      0    0
1    0      0    1
2    1      0    0
3    0      1    0
4    1      0    0
mydatapreprocessing.consolidation.consolidation_functions.cast_str_to_numeric(df: pandas.core.frame.DataFrame, on_error: typing_extensions.Literal['ignore', 'raise'][ignore, raise] = 'ignore') → pandas.core.frame.DataFrame[source]

Convert string values in DataFrame.

Parameters:
  • df (pd.DataFrame) – Data
  • on_error (Literal["ignore", "raise"]) – What to do if meet error. Defaults to ‘ignore’.
Returns:

Data with possibly converted types.

Return type:

pd.DataFrame

mydatapreprocessing.consolidation.consolidation_functions.check_shape_and_transform(data: DataFrameOrArrayGeneric, inplace=False) → DataFrameOrArrayGeneric[source]

Check whether input data has expected shape.

Some functions work with defined shape of data - (n_samples, n_features). If this is not the case, it will transpose the data and log that it happened.

Parameters:
  • data (DataFrameOrArrayGeneric) – Input data.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns:

Data with verified shape.

Return type:

DataFrameOrArrayGeneric

Example

>>> import numpy as np
>>> data = np.array([range(10), range(10)])
>>> data.shape
(2, 10)
>>> data = check_shape_and_transform(data)
>>> data.shape
(10, 2)
mydatapreprocessing.consolidation.consolidation_functions.infer_frequency(df: pandas.core.frame.DataFrame, on_error: typing_extensions.Literal[None, 'warn', 'raise'][None, warn, raise] = 'warn', inplace=False) → pandas.core.frame.DataFrame[source]

When DataFrame has datetime index, it will try to infer it’s frequency.

Parameters:
  • df (pd.DataFrame) – Input data.
  • on_error (Literal[None, "warn", "raise"]) – Define what to do when index is not inferred. Defaults to “warn.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Raises:

ValueError – If defined column failed to convert to datetime.

Returns:

Data with datetime index.

Return type:

pd.DataFrame

Example

>>> df = pd.DataFrame([[1], [2], [3]], index=["08/04/2022", "09/04/2022", "10/04/2022"])
>>> df.index = pd.to_datetime(df.index)
>>> df = infer_frequency(df)
>>> df.index.freq
mydatapreprocessing.consolidation.consolidation_functions.move_on_first_column(df: pandas.core.frame.DataFrame, name_or_index: Union[str, int, pandas.core.indexes.base.Index, numpy.integer]) → pandas.core.frame.DataFrame[source]

Move defined column on index 0.

Use case for that can be for example to be good visible in generated table.

Parameters:
  • df (pd.DataFrame) – Input data.
  • name_or_index (PandasIndex) – Index or name of the column that will be moved.
Raises:

KeyError – Defined column not found in data.

Returns:

DataFrame with defined column at index 0.

Return type:

pd.DataFrame

Example

>>> move_on_first_column(pd.DataFrame([[1, 2, 3]], columns=["One", "Two", "Three"]), "Two").columns
Index(['Two', 'One', 'Three']...
mydatapreprocessing.consolidation.consolidation_functions.remove_nans(data: DataFrameOrArrayGeneric, remove_all_column_with_nans_threshold: None | Numeric = None, remove_nans_type: None | Literal[('interpolate', 'mean', 'neighbor', 'remove')] | Any = None, inplace: bool = False) → DataFrameOrArrayGeneric[source]

Remove NotANumber values.

Columns where are too many NaN values are dropped. Then in rest of columns rows with NaNs are removed or Nans are interpolated.

Parameters:
  • data (DataFrameOrArrayGeneric) – Data in shape (n_samples, n_features).
  • remove_all_column_with_nans_threshold (None | Numeric, optional) – From 0 to 1. Require that many non-nan numeric values in column to not be deleted. E.G if value is 0.9 with column with 10 values, 90% must be numeric that implies max 1 np.nan can be presented, otherwise column will be deleted. Defaults to 0.85.
  • remove_nans_type (None | Literal["interpolate", "mean", "neighbor", "remove"] | Any, optional) – Remove or replace rest nan values. If you want to use concrete value, just use value directly. Defaults to ‘interpolate’.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.

Example

>>> import numpy as np
...
>>> array = np.array([[1, 2, np.nan], [2, np.nan, np.nan], [3, 4, np.nan]])
>>> array
array([[ 1.,  2., nan],
       [ 2., nan, nan],
       [ 3.,  4., nan]])
>>> cleaned_df = remove_nans(
...     array,
...     remove_all_column_with_nans_threshold=0.5,
...     remove_nans_type="interpolate"
... )
>>> cleaned_df
array([[1., 2.],
       [2., 3.],
       [3., 4.]])
mydatapreprocessing.consolidation.consolidation_functions.resample(df: pd.DataFrame, freq: Literal[('S', 'min', 'H', 'M', 'Y')] | str, resample_function: Literal[('sum', 'mean')])[source]

Change the sampling frequency.

Parameters:
  • df (pd.DataFrame) – Input data.
  • freq (Literal["S", "min", "H", "M", "Y"] | str) – Frequency of resampled data. For possible options check pandas ‘Offset aliases’.
  • resample_function (Literal['sum', 'mean'], optional) – ‘sum’ or ‘mean’. Whether sum resampled columns, or use average. Defaults to ‘sum’.
Returns:

Resampled data.

Return type:

pd.DataFrame

Example

>>> from datetime import datetime, timedelta
...
>>> df = pd.DataFrame(
...     {
...         "date": [
...             datetime(2022, 1, 1),
...             datetime(2022, 1, 2),
...             datetime(2022, 2, 1),
...             datetime(2022, 4, 1)
...         ],
...         "col_1": [1] * 4,
...         "col_2": [2] * 4,
...     }
... )
>>> df
        date  col_1  col_2
0 2022-01-01      1      2
1 2022-01-02      1      2
2 2022-02-01      1      2
3 2022-04-01      1      2
>>> df = df.set_index("date")
>>> df = resample(df, "M", "sum")
>>> df
            col_1  col_2
date
2022-01-31      2      4
2022-02-28      1      2
2022-03-31      0      0
2022-04-30      1      2
>>> df.index.freq
<MonthEnd>
mydatapreprocessing.consolidation.consolidation_functions.set_datetime_index(df: pandas.core.frame.DataFrame, name_or_index: Union[str, int, pandas.core.indexes.base.Index, numpy.integer], on_error: typing_extensions.Literal['ignore', 'raise'][ignore, raise] = 'ignore', inplace: bool = False) → pandas.core.frame.DataFrame[source]

Set defined column as index and convert it to datetime.

Parameters:
  • df (pd.DataFrame) – Input data.
  • name_or_index (PandasIndex) – Name or index of datetime column that will be set as index. Defaults to None.
  • on_error (Literal["ignore", "raise"]) – What happens if converting to datetime fails. Defaults to “ignore”.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Raises:

ValueError – If defined column failed to convert to datetime.

Returns:

Data with datetime index.

Return type:

pd.DataFrame

Example

>>> from datetime import datetime
...
>>> df = pd.DataFrame(
...     {
...         "col_1": [1] * 3,
...         "col_2": [2] * 3,
...         "date": [
...             datetime(2022, 1, 1),
...             datetime(2022, 2, 1),
...             datetime(2022, 3, 1),
...         ],
...     }
... )
>>> df = set_datetime_index(df, 'date', inplace=True)
>>> isinstance(df.index, pd.DatetimeIndex)
True