mydatapreprocessing.load_data package

This module helps you to load data from path as well as from web url in various formats.

Supported path formats are:

  • csv
  • xlsx and xls
  • json
  • parquet
  • h5

You can insert more files (urls) at once and your data will be automatically concatenated.

Main function is load_data where you can find working examples.

There is also function get_file_paths which open an dialog window in your operation system and let you choose your files in convenient way. This tuple output you can then insert into load_data.

mydatapreprocessing.load_data.load_data(data: DataFormat, header: Literal['infer'] | int | None = 'infer', csv_style: dict | Literal['infer'] = 'infer', field: str = '', sheet: str | int = 0, max_imported_length: None | int = None, request_datatype_suffix: str | None = '', data_orientation: Literal[('columns', 'index')] = 'columns', ssl_verification: None | bool | str = None) → pd.DataFrame[source]

Load data from path or url or other python format (numpy array, list, dict) into DataFrame.

Available formats are csv, excel xlsx, parquet, json or h5. Allow multiple files loading at once - just put it in list e.g. [df1, df2, df3] or [‘my_file1.csv’, ‘my_file2.csv’]. Structure of files does not have to be the same. If you have files in folder and not in list, you can use get_file_paths function to open system dialog window, select files and get the list of paths.

Parameters:
  • data (Any) – Path to file, url or python data. For examples check examples section.
  • header (Literal['infer'] | int | None, optional) – Row index used as column names. If ‘infer’, it will be automatically chosen. Defaults to ‘infer’.
  • csv_style (dict | Literal["infer"], optional) – Define CSV separator and decimal. En locale usually use {'sep': ",", 'decimal': "."} some European country use {'sep': ";", 'decimal': ","}. If ‘infer’ one of those two locales is automatically used. Defaults to ‘infer’.
  • field (str , optional) – If using json, it means what field to use. For example “field.subfield.data” as if json was dict means data[field][subfield][data] key values, if SQL, then it mean what table. If empty string, root level is used. Defaults to ‘’.
  • sheet (str | int, optional) – If using xls or xlsx excel file it define what sheet will be used. If using h5, it define what key (chanel group) to use. Defaults to 0.
  • max_imported_length (None | int, optional) – Max length of imported samples (before resampling). If 0, than full length. Defaults to None.
  • request_datatype_suffix (str | None, optional) – ‘json’ for example. If using url with no extension, define which datatype is on this url with GET request. Defaults to “”.
  • data_orientation (Literal["columns", "index"], optional) – ‘columns’ or ‘index’. If using json or dictionary, it describe how data are oriented. Defaults to “columns”.
  • ssl_verification (None | bool | str, optional) – If using data from web, it use requests and sometimes, there can be ssl verification error, this skip verification, with adding verify param to requests call. It!s param of requests get function. Defaults to None.
Raises:

FileNotFoundError, TypeError, ValueError, ModuleNotFoundError – If not existing file, or url, or if necessary dependency library not found.

Returns:

Loaded data in pd.DataFrame format.

Return type:

pd.DataFrame

Examples

Some python formats if data are already in python

Numpy array or pandas DataFrame

>>> import numpy as np
>>> array_or_DataFrame = np.random.randn(10, 2)

List of records

>>> records = [{'col_1': 3, 'col_2': 'a'}, {'col_1': 0, 'col_2': 'd'}] # List of records

Dict with columns or rows (index) - necessary to setup data_orientation!

>>> dict_data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}

External files

Local file.

>>> local_file = r"tests/test_files/tested_data.csv"  # The same with .parquet, .h5, .json or .xlsx.

Web URL. If it has suffix in URL, it’s not necessary to define the type, but when it’s missing, you have to specify also e.g. ‘request_datatype_suffix’: “json”. Sometimes, you need to define also further information like ‘data_orientation’: “index” whether use index or columns, or for json ‘field’: ‘data’ which define in what field data are stored (it can be nested with dots like ‘dataset_1.data.summer’).

>>> url = "https://raw.githubusercontent.com/Malachov/mydatapreprocessing/master/tests/test_files/list.json"

You can use more files in list and data will be concatenated. It can be list of paths or list of python objects. For example:

>>> multiple_dict_data = [{'col_1': 3, 'col_2': 'a'}, {'col_1': 0, 'col_2': 'd'}]  # List of records
>>> multiple_arrays_or_dfs = [np.random.randn(20, 3), np.random.randn(25, 3)]  # DataFrame same way
>>> multiple_urls = ["https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv", "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"]
>>> multiple_local_files = ["tests/test_files/tested_data.csv", "tests/test_files/tested_data.csv"]

You can also use some test data. Use one of ‘test_ecg’ ‘test_ramp’ ‘test_sin’ ‘test_random’.

>>> data_loaded = []
>>> for i in [
...     array_or_DataFrame, local_file, records, dict_data, url,
...     multiple_dict_data, multiple_arrays_or_dfs, multiple_urls, multiple_local_files
... ]:
...     data_loaded.append(load_data(i).size > 0)
>>> all(data_loaded)
True

Note

On windows when using paths, it’s necessary to use raw string - ‘r’ in front of string because of escape symbols