mydatapreprocessing.consolidation.consolidation_functions package¶
Functions that are usually used from ‘consolidation_pipeline’. Of course you can use it separately.
-
mydatapreprocessing.consolidation.consolidation_functions.
categorical_embedding
(data: pandas.core.frame.DataFrame, embedding: typing_extensions.Literal['label', 'one-hot'][label, one-hot] = 'label', unique_threshold: Union[float, int, numpy.number] = 0.6, inplace=False) → pandas.core.frame.DataFrame[source]¶ Transform string categories such as ‘US’, ‘FR’ into numeric values.
This is necessary for example in machine learnings models.
Parameters: - data (pd.DataFrame) – Data with string (pandas Object dtype) columns.
- embedding ("label", "one-hot", optional) – ‘label’ or ‘one-hot’. Categorical encoding. Create numbers from strings. ‘label’ give each category (unique string) concrete number. Result will have same number of columns. ‘one-hot’ create for every category new column. Only columns, where are strings repeating (unique_threshold) will be used. Defaults to “label”.
- unique_threshold (Numeric, optional) – Remove string columns, that have to many categories (ids, hashes etc.). E.g 0.9 defines that in column of length 100, max number of categories to not to be deleted is 10 (90% non unique repeating values). Defaults to 0.6. Min is 0, max is 1. Defaults is 0.6.
- inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns: DataFrame where string columns transformed to numeric.
Return type: pd.DataFrame
Raises: TypeError
– If there is unhashable object in values for example.Example
>>> df = pd.DataFrame(["One", "Two", "One", "Three", "One"]) >>> categorical_embedding(df, embedding="label", unique_threshold=0.1) 0 0 0 1 2 2 0 3 1 4 0 >>> categorical_embedding(df, embedding="one-hot", unique_threshold=0.1) One Three Two 0 1 0 0 1 0 0 1 2 1 0 0 3 0 1 0 4 1 0 0
-
mydatapreprocessing.consolidation.consolidation_functions.
cast_str_to_numeric
(df: pandas.core.frame.DataFrame, on_error: typing_extensions.Literal['ignore', 'raise'][ignore, raise] = 'ignore') → pandas.core.frame.DataFrame[source]¶ Convert string values in DataFrame.
Parameters: - df (pd.DataFrame) – Data
- on_error (Literal["ignore", "raise"]) – What to do if meet error. Defaults to ‘ignore’.
Returns: Data with possibly converted types.
Return type: pd.DataFrame
-
mydatapreprocessing.consolidation.consolidation_functions.
check_shape_and_transform
(data: DataFrameOrArrayGeneric, inplace=False) → DataFrameOrArrayGeneric[source]¶ Check whether input data has expected shape.
Some functions work with defined shape of data - (n_samples, n_features). If this is not the case, it will transpose the data and log that it happened.
Parameters: - data (DataFrameOrArrayGeneric) – Input data.
- inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns: Data with verified shape.
Return type: DataFrameOrArrayGeneric
Example
>>> import numpy as np >>> data = np.array([range(10), range(10)]) >>> data.shape (2, 10) >>> data = check_shape_and_transform(data) >>> data.shape (10, 2)
-
mydatapreprocessing.consolidation.consolidation_functions.
infer_frequency
(df: pandas.core.frame.DataFrame, on_error: typing_extensions.Literal[None, 'warn', 'raise'][None, warn, raise] = 'warn', inplace=False) → pandas.core.frame.DataFrame[source]¶ When DataFrame has datetime index, it will try to infer it’s frequency.
Parameters: - df (pd.DataFrame) – Input data.
- on_error (Literal[None, "warn", "raise"]) – Define what to do when index is not inferred. Defaults to “warn.
- inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Raises: ValueError
– If defined column failed to convert to datetime.Returns: Data with datetime index.
Return type: pd.DataFrame
Example
>>> df = pd.DataFrame([[1], [2], [3]], index=["08/04/2022", "09/04/2022", "10/04/2022"]) >>> df.index = pd.to_datetime(df.index) >>> df = infer_frequency(df) >>> df.index.freq
-
mydatapreprocessing.consolidation.consolidation_functions.
move_on_first_column
(df: pandas.core.frame.DataFrame, name_or_index: Union[str, int, pandas.core.indexes.base.Index, numpy.integer]) → pandas.core.frame.DataFrame[source]¶ Move defined column on index 0.
Use case for that can be for example to be good visible in generated table.
Parameters: - df (pd.DataFrame) – Input data.
- name_or_index (PandasIndex) – Index or name of the column that will be moved.
Raises: KeyError
– Defined column not found in data.Returns: DataFrame with defined column at index 0.
Return type: pd.DataFrame
Example
>>> move_on_first_column(pd.DataFrame([[1, 2, 3]], columns=["One", "Two", "Three"]), "Two").columns Index(['Two', 'One', 'Three']...
-
mydatapreprocessing.consolidation.consolidation_functions.
remove_nans
(data: DataFrameOrArrayGeneric, remove_all_column_with_nans_threshold: None | Numeric = None, remove_nans_type: None | Literal[('interpolate', 'mean', 'neighbor', 'remove')] | Any = None, inplace: bool = False) → DataFrameOrArrayGeneric[source]¶ Remove NotANumber values.
Columns where are too many NaN values are dropped. Then in rest of columns rows with NaNs are removed or Nans are interpolated.
Parameters: - data (DataFrameOrArrayGeneric) – Data in shape (n_samples, n_features).
- remove_all_column_with_nans_threshold (None | Numeric, optional) – From 0 to 1. Require that many non-nan numeric values in column to not be deleted. E.G if value is 0.9 with column with 10 values, 90% must be numeric that implies max 1 np.nan can be presented, otherwise column will be deleted. Defaults to 0.85.
- remove_nans_type (None | Literal["interpolate", "mean", "neighbor", "remove"] | Any, optional) – Remove or replace rest nan values. If you want to use concrete value, just use value directly. Defaults to ‘interpolate’.
- inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Example
>>> import numpy as np ... >>> array = np.array([[1, 2, np.nan], [2, np.nan, np.nan], [3, 4, np.nan]]) >>> array array([[ 1., 2., nan], [ 2., nan, nan], [ 3., 4., nan]]) >>> cleaned_df = remove_nans( ... array, ... remove_all_column_with_nans_threshold=0.5, ... remove_nans_type="interpolate" ... ) >>> cleaned_df array([[1., 2.], [2., 3.], [3., 4.]])
-
mydatapreprocessing.consolidation.consolidation_functions.
resample
(df: pd.DataFrame, freq: Literal[('S', 'min', 'H', 'M', 'Y')] | str, resample_function: Literal[('sum', 'mean')])[source]¶ Change the sampling frequency.
Parameters: - df (pd.DataFrame) – Input data.
- freq (Literal["S", "min", "H", "M", "Y"] | str) – Frequency of resampled data. For possible options check pandas ‘Offset aliases’.
- resample_function (Literal['sum', 'mean'], optional) – ‘sum’ or ‘mean’. Whether sum resampled columns, or use average. Defaults to ‘sum’.
Returns: Resampled data.
Return type: pd.DataFrame
Example
>>> from datetime import datetime, timedelta ... >>> df = pd.DataFrame( ... { ... "date": [ ... datetime(2022, 1, 1), ... datetime(2022, 1, 2), ... datetime(2022, 2, 1), ... datetime(2022, 4, 1) ... ], ... "col_1": [1] * 4, ... "col_2": [2] * 4, ... } ... ) >>> df date col_1 col_2 0 2022-01-01 1 2 1 2022-01-02 1 2 2 2022-02-01 1 2 3 2022-04-01 1 2 >>> df = df.set_index("date") >>> df = resample(df, "M", "sum") >>> df col_1 col_2 date 2022-01-31 2 4 2022-02-28 1 2 2022-03-31 0 0 2022-04-30 1 2 >>> df.index.freq <MonthEnd>
-
mydatapreprocessing.consolidation.consolidation_functions.
set_datetime_index
(df: pandas.core.frame.DataFrame, name_or_index: Union[str, int, pandas.core.indexes.base.Index, numpy.integer], on_error: typing_extensions.Literal['ignore', 'raise'][ignore, raise] = 'ignore', inplace: bool = False) → pandas.core.frame.DataFrame[source]¶ Set defined column as index and convert it to datetime.
Parameters: - df (pd.DataFrame) – Input data.
- name_or_index (PandasIndex) – Name or index of datetime column that will be set as index. Defaults to None.
- on_error (Literal["ignore", "raise"]) – What happens if converting to datetime fails. Defaults to “ignore”.
- inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Raises: ValueError
– If defined column failed to convert to datetime.Returns: Data with datetime index.
Return type: pd.DataFrame
Example
>>> from datetime import datetime ... >>> df = pd.DataFrame( ... { ... "col_1": [1] * 3, ... "col_2": [2] * 3, ... "date": [ ... datetime(2022, 1, 1), ... datetime(2022, 2, 1), ... datetime(2022, 3, 1), ... ], ... } ... ) >>> df = set_datetime_index(df, 'date', inplace=True) >>> isinstance(df.index, pd.DatetimeIndex) True