mydatapreprocessing.misc package

Miscellaneous functions that do not fit into other modules.

You can find here for example functions for train / test split, function for rolling windows, function that clean the DataFrame for print as table or function that will add gaps to time series data where are no data so two remote points are not joined in plot.

mydatapreprocessing.misc.add_none_to_gaps(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

If empty windows in sampled signal, it will add None values (one row) to the empty window start.

Reason is to correct plotting. Points are connected, but not between two gaps.

Parameters:df (pd.DataFrame) – DataFrame with time index.
Returns:DataFrame with None row inserted in time gaps.
Return type:pd.DataFrame
Raises:NotImplementedError – String values are not supported, use only numeric columns.

Note

Df will be converted to float64 dtypes, to be able to use np.nan.

Example

>>> data = pd.DataFrame([[0, 1]] * 7, index=[0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 2.0])
>>> data
     0  1
0.1  0  1
0.2  0  1
0.3  0  1
1.0  0  1
1.1  0  1
1.2  0  1
2.0  0  1
>>> df_gaps = add_none_to_gaps(data)
>>> df_gaps
       0    1
0.1  0.0  1.0
0.2  0.0  1.0
0.3  0.0  1.0
0.4  NaN  NaN
1.0  0.0  1.0
1.1  0.0  1.0
1.2  0.0  1.0
1.3  NaN  NaN
2.0  0.0  1.0
mydatapreprocessing.misc.edit_table_to_printable(df: pandas.core.frame.DataFrame, line_length_limit: int = 16, round_decimals: int = 3, number_length_limit: Union[float, int, numpy.number] = 1000000000.0) → pandas.core.frame.DataFrame[source]

Edit DataFrame to be able to use in tabulate (or somewhere else).

Parameters:
  • df (pd.DataFrame) – Input data with numeric or text columns.
  • line_length_limit (int, optional) – Add line breaks if line too long. Defaults to 16.
  • round_decimals (int, optional) – Round numeric columns to defined decimals. Defaults to 3.
  • number_length_limit (Numeric, optional) – If there is some very big or very small number, convert format to scientific notation. Defaults to 10e8.
Returns:

DataFrame with shorter and more readable to be printed (usually in table).

Return type:

pd.DataFrame

Note

DataFrame column names can be changed ('\n' is added).

Example

>>> df = pd.DataFrame([[151646516516, 1.5648646, "Lorem ipsum something else"], [1, 2, "3"]])
>>> for_table = edit_table_to_printable(df).values[0]
>>> for_table[0]
'1.516e+11'
>>> for_table[1]
1.565
>>> for_table[2]
'Lorem ipsum\nsomething else'
mydatapreprocessing.misc.rolling_windows(data: numpy.ndarray, window: int) → numpy.ndarray[source]

Generate matrix of rolling windows.

It uses numpy slide tricks so it returns a view. Benefit is that it is much more memory efficient, but you must beware that if you change new array, changes will occur also on original data.

Example

>>> rolling_windows(np.array([1, 2, 3, 4, 5]), window=2)
array([[1, 2],
       [2, 3],
       [3, 4],
       [4, 5]])

If you dimension bigger than 2 you can use it as well

>>> rolling_windows(np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]), window=2)
array([[[ 1,  2],
        [ 2,  3],
        [ 3,  4],
        [ 4,  5]],
       [[ 6,  7],
        [ 7,  8],
        [ 8,  9],
        [ 9, 10]]])
Parameters:
  • data (np.ndarray) – Array data input.
  • window (int) – Number of values in created window.
Returns:

Array of defined windows

Return type:

np.ndarray

mydatapreprocessing.misc.split(data, predicts=7)[source]

Divide data set on train and test set.

Predicted column is supposed to be 0. This is mostly for time series predictions, so in test set there is only predicted column that can be directly used for error criterion evaluation. So this function is different than usual train / test split.

Parameters:
  • data (pd.DataFrame | np.ndarray) – Time series data. ndim has to be 2, reshape if necessary.
  • predicts (int, optional) – Number of predicted values. Defaults to 7.
Returns:

Train set and test set. If input in numpy array, then also output in array, if DataFrame input, then DataFrame output.

Return type:

tuple[pd.DataFrame | np.ndarray, pd.Series | np.ndarray]

Example

>>> data = np.array([[1], [2], [3], [4]])
>>> train, test = split(data, predicts=2)
>>> train
array([[1],
       [2]])
>>> test
array([3, 4])