Skip to main content

Library/framework for making predictions.

Project description

mydatapreprocessing

Python versions PyPI version Downloads Binder Language grade: Python Documentation Status License: MIT codecov

Load data from web link or local file (json, csv, Excel file, parquet, h5...), consolidate it (resample data, clean NaN values, do string embedding) derive new features via columns derivation and do preprocessing like standardization or smoothing. If you want to see how functions works, check it's docstrings - working examples with printed results are also in tests - visual.py.

Links

Repo on GitHub

Official readthedocs documentation

Installation

Python >=3.6 (Python 2 is not supported).

Install just with

pip install mydatapreprocessing

There are some libraries that not every user will be using (for some specific data inputs for example). If you want to be sure to have all libraries, you can provide extras requirements like.

pip install mydatapreprocessing[datatypes]

Available extras are ["all", "datasets", "datatypes"]

Examples

You can use live jupyter demo on binder

import mydatapreprocessing as mdp
import pandas as pd
import numpy as np

Load data

You can use:

  • python formats (numpy.ndarray, pd.DataFrame, list, tuple, dict)
  • local files
  • web urls

Supported path formats are:

  • csv
  • xlsx and xls
  • json
  • parquet
  • h5

You can load more data at once in list.

Syntax is always the same.

data = mdp.load_data.load_data(
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv",
)
# data2 = mdp.load_data.load_data([PATH_TO_FILE.csv, PATH_TO_FILE2.csv])

Consolidation

If you want to use data for some machine learning models, you will probably want to remove Nan values, convert string columns to numeric if possible, do encoding or keep only numeric data and resample.

Consolidation is working with pandas DataFrame as column names matters here.

There are many functions, but there is main function pipelining other functions consolidate_data

consolidation_config = mdp.consolidation.consolidation_config.default_consolidation_config.do.copy()
consolidation_config.datetime.datetime_column = 'Date'
consolidation_config.resample.resample = 'M'
consolidation_config.resample.resample_function = "mean"
consolidation_config.dtype = 'float32'

consolidated = mdp.consolidation.consolidate_data(data, consolidation_config)
print(consolidated.head())

Feature engineering

Functions in feature_engineering and preprocessing expects that data are in form (n_samples, n_features). n_samples are usually much bigger and therefore transformed in consolidate_data if necessary.

In config, you can use shorter update dict syntax as all values names are unique.

Feature engineering

Create new columns that can be for example used as another machine learning model input.

import mydatapreprocessing.feature_engineering as mdpf
import mydatapreprocessing as mdp

data = pd.DataFrame(
    [mdp.datasets.sin(n=30), mdp.datasets.ramp(n=30)]
).T

extended = mdpf.add_derived_columns(data, differences=True, rolling_means=10)
print(extended.columns)
print(f"\nit has less rows then on input {len(extended)}")

Functions in feature_engineering and preprocessing expects that data are in form (n_samples, n_features). n_samples are usually much bigger and therefore transformed in consolidate_data if necessary.

Preprocessing

Preprocessing can be used on pandas DataFrame as well as on numpy array. Column names are not important as it's just matrix with defined dtype.

There is many functions, but there is main function pipelining other functions preprocess_data Preprocessed data can be converted back with preprocess_data_inverse

from mydatapreprocessing import preprocessing as mdpp

df = pd.DataFrame(np.array([range(5), range(20, 25), np.random.randn(5)]).astype("float32").T)
df.iloc[2, 0] = 500

config = mdpp.preprocessing_config.default_preprocessing_config.do.copy()
config.do.update({"remove_outliers": None, "difference_transform": True, "standardize": "standardize"})
data_preprocessed, inverse_config = mdpp.preprocess_data(df.values, config)
inverse_config.difference_transform = df.iloc[0, 0]
data_preprocessed_inverse = mdpp.preprocess_data_inverse(
    data_preprocessed[:, 0], inverse_config
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mydatapreprocessing-3.0.3.tar.gz (39.0 kB view details)

Uploaded Source

Built Distribution

mydatapreprocessing-3.0.3-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file mydatapreprocessing-3.0.3.tar.gz.

File metadata

  • Download URL: mydatapreprocessing-3.0.3.tar.gz
  • Upload date:
  • Size: 39.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for mydatapreprocessing-3.0.3.tar.gz
Algorithm Hash digest
SHA256 a74ce53a75ae1fc353f83f0bc5c2815e1520a701ed181d96362e164bb09f7f29
MD5 5e42bac4feb41e131dafbdde2cb6f31e
BLAKE2b-256 63b5e4b0d97599501bed7d4b2a8340cff59de3caf288326fa39e9df8f1172ace

See more details on using hashes here.

File details

Details for the file mydatapreprocessing-3.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mydatapreprocessing-3.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a00847a40c1fffdddf81c17a2e58225172dd68c03d74e556af42243e8c0bd7be
MD5 423e560453a371af31f9611f3b4a996e
BLAKE2b-256 9de22542523aa5e8bde5468cfea78116f943a80e16809b0f06ec39bc635d907b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page