mydatapreprocessing

Py versions PyPI package Downloads Jupyter mybinder Language grade: Python Documentation Status License: MIT Codecov

Load data from web link or local file (json, csv, excel file, parquet, h5…), consolidate it (resample data, clean NaN values, do string embedding) derive new featurs via columns derivation and do preprocessing like standardization or smoothing. If you want to see how functions works, check it’s docstrings - working examples with printed results are also in tests - visual.py.

Installation

Python >=3.6 (Python 2 is not supported).

Install just with:

pip install mydatapreprocessing

There are some libraries that not every user will be using (for some data inputs). If you want to be sure to have all libraries, you can download requirements_advanced.txt and then install advanced requirements with pip install -r requirements_advanced.txt.

Examples:

>>> import mydatapreprocessing as mdp

Load data

You can use

  • python formats (numpy.ndarray, pd.DataFrame, list, tuple, dict)
  • local files
  • web urls

You can load more data at once in list.

Syntax is always the same.

>>> data = mdp.load_data.load_data(
...     "https://www.ncdc.noaa.gov/cag/global/time-series/globe/land_ocean/ytd/12/1880-2016.json",
...     request_datatype_suffix=".json",
...     data_orientation="index",
...     predicted_table="data",
... )
>>> # data2 = mdp.load_data.load_data([PATH_TO_FILE.csv, PATH_TO_FILE2.csv])

Consolidation

If you want to use data for some machine learning models, you will probably want to remove Nan values, convert string columns to numeric if possible, do encoding or keep only numeric data and resample.

>>> data = mdp.preprocessing.data_consolidation(
...     data, predicted_column=0, remove_nans_threshold=0.9, remove_nans_or_replace="interpolate"
... )

Feature engineering

Functions in feature_engineering and preprocessing expects that data are in form (n_samples, n_features). n_samples are ususally much bigger and therefore transformed in data_consolidation if necessary.

>>> data = mdp.feature_engineering.add_derived_columns(data, differences=True, rolling_means=32)

Preprocessing

preprocess_data returns preprocessed data, but also last undifferenced value and scaler for inverse transformation, so unpack it with _

>>> data_preprocessed, _, _ = mdp.preprocessing.preprocess_data(
...     data,
...     remove_outliers=3,
...     smoothit=None,
...     correlation_threshold=False,
...     data_transform=False,
...     standardizeit="standardize",
... )

Creating inputs >>> seqs, Y, x_input, test_inputs = mdp.create_model_inputs.make_sequences( … data_preprocessed.values, predicts=7, repeatit=3, n_steps_in=6, n_steps_out=1, constant=1 … )

Submodules

create_model_inputs

This is module that from time series data create inputs for machine learning models like scikit learn or tensorflow. Usual data inputs types are (X, y, x_input). X stands for vector of inputs, y for vector of outputs and x_input is input for new predictions we want to create.

There are functions make_sequences that create seqences from time samples, create_inputs that tell the first function what sequences create for what models and create_tests_outputs that for defined inputs create outputs that we can compute error criterion like rmse with.

Functions are documented in it’s docstrings.

database

Module include two functions: database_load and database_deploy. First download data from database - it’s necessary to set up connect credentials. The database_deploy than deploy data to the database server.

It is working only for mssql server so far.

feature_engineering

You can add new derived columns. This new generated data can help to machine learning models to better results.

In add_derived_columns you add first and second derivations, multiplication of columns, rolling means and rolling standard deviation.

In add_frequency_columns you can add fast fourier transform results maximums on running window.

generate_data

Test data definition. Data can be used for example for validating machine learning time series prediction results.

Only ‘real’ data are ECG heart signal returned with function get_ecg().

load_data

This module helps you to load data from path as well as from web url in various formats.

Supported path formats are:

  • csv
  • xlsx and xls
  • json
  • parquet
  • h5

You can insert more files (urls) at once and your data will be automatically concatenated.

Main function is load_data where you can find working examples.

There is also function get_file_paths which open an dialog window in your operation system and let you choose your files in convenient way. This tuple output you can then insert into load_data.

misc

Miscellaneous functions that do not fit into other modules. You can find here for example functions for train / test split, function for rolling windows, function that clean the dataframe for print in table or function that will add gaps to time series data where are no data so two remote points are not joined in plot.

preprocessing

Module for data preprocessing.

You can consolidate data with data_consolidation and optimize it for example for machine learning models.

Then you can preprocess the data to be able to achieve even better results.

There are many small functions that you can use separately, but there is main function preprocess_data that call all the functions based on input params for you. For inverse preprocessing use preprocess_data_inverse