mydatapreprocessing.preprocessing.preprocessing_functions package

Various functions for data preprocessing. Usually it’s used from preprocessing subpackage with preprocessing pipeline function. It can be used separately as well.

mydatapreprocessing.preprocessing.preprocessing_functions.binning(data: DataFrameOrArrayGeneric, bins: int, binning_type: typing_extensions.Literal['cut', 'qcut'][cut, qcut] = 'cut') → DataFrameOrArrayGeneric[source]

Discretize value on defined number of bins.

It will return the same shape of data, where middle (average) values of bins interval returned.

Parameters:
  • data (DataFrameOrArrayGeneric) – Data for preprocessing. ndim = 2 (n_samples, n_features).
  • bins (int) – Number of bins - unique values.
  • binning_type (Literal["cut", "qcut"], optional) – “cut” for equal size of bins intervals (different number of members in bins) or “qcut” for equal number of members in bins and various size of bins. It uses pandas cut or qcut function. Defaults to “cut”.
Returns:

Discretized data of same type as input. If input in numpy array, then also output in array, if DataFrame input, then DataFrame output.

Return type:

DataFrameOrArrayGeneric

Example

>>> binning(np.array(range(5)), bins=3, binning_type="cut")
array([[0.6645],
       [0.6645],
       [2.    ],
       [3.3335],
       [3.3335]])
mydatapreprocessing.preprocessing.preprocessing_functions.do_difference(data: DataFrameOrArrayGeneric) → DataFrameOrArrayGeneric[source]

Transform data into neighbor difference.

Parameters:data (DataFrameOrArrayGeneric) – Data.
Returns:Differenced data in same format as inserted.
Return type:DataFrameOrArrayGeneric

Examples

>>> data = np.array([1, 3, 5, 2])
>>> print(do_difference(data))
[ 2  2 -3]
mydatapreprocessing.preprocessing.preprocessing_functions.fitted_power_transform(data: np.ndarray, fitted_stdev: float, mean: float | None = None, fragments: int = 10, iterations: int = 5) → np.ndarray[source]

Function transforms data, so it will have similar standard deviation and mean.

It use Box-Cox power transform in SciPy lib.

Parameters:
  • data (np.ndarray) – Array of data that should be transformed (one column => ndim = 1).
  • fitted_stdev (float) – Standard deviation that we want to have.
  • mean (float | None, optional) – Mean of transformed data. Defaults to None.
  • fragments (int, optional) – How many lambdas will be used in one iteration. Defaults to 10.
  • iterations (int, optional) – How many iterations will be used to find best transform. Defaults to 5.
Returns:

Transformed data with demanded standard deviation and mean.

Return type:

np.ndarray

mydatapreprocessing.preprocessing.preprocessing_functions.inverse_difference(data: numpy.ndarray, last_undiff_value: Union[float, int, numpy.number]) → numpy.ndarray[source]

Transform do_difference transform back.

Parameters:
  • data (np.ndarray) – One dimensional differenced data from do_difference function.
  • last_undiff_value (Numeric) – First value to computer the rest.
Returns:

Normal data, not the additive series.

Return type:

np.ndarray

Examples

>>> data = np.array([1, 1, 1, 1])
>>> print(inverse_difference(data, 1))
[2 3 4 5]
mydatapreprocessing.preprocessing.preprocessing_functions.remove_the_outliers(data: DataFrameOrArrayGeneric, threshold: Union[float, int, numpy.number] = 3) → DataFrameOrArrayGeneric[source]

Deprecated function. Historically, remove_outliers was parameter in pipeline and in the same module, function needed different name. Use remove_outliers if possible. This will be removed in new major.

mydatapreprocessing.preprocessing.preprocessing_functions.remove_outliers(data: DataFrameOrArrayGeneric, threshold: Union[float, int, numpy.number] = 3) → DataFrameOrArrayGeneric[source]

Remove values far from mean - probably errors.

If more columns, then only rows that have outlier on predicted column will be deleted. Predicted column (column where we are searching for outliers) is supposed to be 0.

Parameters:
  • data (DataFrameOrArrayGeneric) – Time series data. Must have ndim = 2, so if univariate, reshape.
  • threshold (Numeric, optional) – How many times must be standard deviation from mean to be ignored. Defaults to 3.
Returns:

Cleaned data.

Return type:

DataFrameOrArrayGeneric

Examples

>>> data = np.array(
...     [
...         [1, 7],
...         [66, 3],
...         [5, 5],
...         [2, 3],
...         [2, 3],
...         [3, 9],
...     ]
... )
>>> remove_outliers(data, threshold=2)
array([[1, 7],
       [5, 5],
       [2, 3],
       [2, 3],
       [3, 9]])
mydatapreprocessing.preprocessing.preprocessing_functions.smooth(data: DataFrameOrArrayGeneric, window=101, polynomial_order=2, inplace: bool = False) → DataFrameOrArrayGeneric[source]

Smooth data (reduce noise) with Savitzky-Golay filter. For more info on filter check scipy docs.

Parameters:
  • data (DataFrameOrArrayGeneric) – Input data.
  • window (int, optional) – Length of sliding window. Must be odd. Defaults to 101.
  • polynomial_order (int, optional) –
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns:

Cleaned data with less noise.

Return type:

DataFrameOrArrayGeneric

mydatapreprocessing.preprocessing.preprocessing_functions.standardize_one_way(data: DataFrameOrArrayGeneric, minimum: float, maximum: float, axis: typing_extensions.Literal[0, 1][0, 1] = 0, inplace: bool = False) → DataFrameOrArrayGeneric[source]

Own implementation of standardization. No inverse transformation available.

Reason is for builded applications to do not carry sklearn with build.

Parameters:
  • data (DataFrameOrArrayGeneric) – Data.
  • minimum (float) – Minimum in transformed axis.
  • maximum (float) – Max in transformed axis.
  • axis (Literal[0, 1], optional) – 0 to columns, 1 to rows. Defaults to 0.
  • inplace (bool, optional) – If True, then original data are edited. If False, copy is created. Defaults to False.
Returns:

Standardized data. If numpy inserted, numpy returned, same for DataFrame. If input in numpy array, then also output in array, if DataFrame input, then DataFrame output.

Return type:

DataFrameOrArrayGeneric

mydatapreprocessing.preprocessing.preprocessing_functions.standardize(data: DataFrameOrArrayGeneric, used_scaler: Literal[('standardize', '01', '-11', 'robust')] = 'standardize') → tuple[DataFrameOrArrayGeneric, 'ScalerType'][source]

Standardize or normalize data.

More standardize methods available. Predicted column is supposed to be 0.

Parameters:
  • data (DataFrameOrArrayGeneric) – Time series data.
  • used_scaler (Literal['standardize', '01', '-11', 'robust'], optional) – ‘01’ and ‘-11’ means scope from to for normalization. ‘robust’ use RobustScaler and ‘standardize’ use StandardScaler - mean is 0 and std is 1. Defaults to ‘standardize’.
Returns:

Standardized data and scaler for inverse transformation.

Return type:

tuple[DataFrameOrArrayGeneric, ScalerType]