Skip to main content

Library/framework for making predictions.

Project description

mydatapreprocessing

PyPI pyversions PyPI version Language grade: Python Build Status Documentation Status License: MIT codecov

Load data from web link or local file (json, csv, excel file, parquet, h5...), consolidate it and do preprocessing like resampling, standardization, string embedding, new columns derivation, feature extraction etc. based on configuration.

Library contain 3 modules.

Preprocessing

First - preprocessing load data, consolidate it and do the preprocessing. It contains functions like load_data, data_consolidation, preprocess_data, preprocess_data_inverse, add_frequency_columns, rolling_windows, add_derived_columns etc.

Example

import mydatapreprocessing.preprocessing as mdpp

data = "https://blockchain.info/unconfirmed-transactions?format=json"

# Load data from file or URL
data_loaded = mdpp.load_data(data, request_datatype_suffix=".json", predicted_table='txs')


#Some examples of other inputs to data_load function

# myarray_or_dataframe # Numpy array or Pandas.DataFrame
# r"/home/user/my.json" # Local file. The same with .parquet, .h5, .json or .xlsx. On windows it's necessary to use raw string - 'r' in front of string because of escape symbols \
# "https://yoururl/your.csv" # Web url (with suffix). Same with json.
# "https://blockchain.info/unconfirmed-transactions?format=json" # In this case you have to specify also 'request_datatype_suffix': "json", 'data_orientation': "index", 'predicted_table': 'txs',
# {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} # Dict with colums or rows (index) - necessary to setup data_orientation!


# You can use more files in list and data will be concatenated. It can be list of paths or list of python objects. Example:

# [{'col_1': 3, 'col_2': 'a'}, {'col_1': 0, 'col_2': 'd'}]  # List of records
# [np.random.randn(20, 3), np.random.randn(25, 3)]  # Dataframe same way
# ["https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv", "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"]  # List of URLs
# ["path/to/my1.csv", "path/to/my1.csv"]


# Transform various data into defined format - pandas dataframe - convert to numeric if possible, keep
# only numeric data and resample ifg configured. It return array, dataframe
data_consolidated = mdpp.data_consolidation(
    data_loaded, predicted_column="weight", data_orientation="index", remove_nans_threshold=0.9, remove_nans_or_replace='interpolate')

# You can add some extra informations to the data that can help (beware it can slow down the machine learning model)
to_be_extended = np.array([[0, 2] * 64, [0, 0, 0, 5] * 32]).T
extended = mdpp.add_frequency_columns(to_be_extended, window=8)


to_be_extended2 = pd.DataFrame([range(30), range(30, 60)]).T
extended2 = mdpp.add_derived_columns(to_be_extended2, differences=True, second_differences=True, multiplications=True,
                                    rolling_means=True, rolling_stds=True, mean_distances=True, window=10)

# Feature extraction is under development  :[

# Preprocess data. It return preprocessed data, but also last undifferenced value and scaler for inverse
# transformation, so unpack it with _
data_preprocessed, _, _ = mdpp.preprocess_data(data_consolidated, remove_outliers=True, smoothit=False,
                                              correlation_threshold=False, data_transform=False, standardizeit='standardize')

Inputs

Second module is inputs. It take tabular time series data and put it into format (input vector X, output vector y and input for predicted value x_input) that can be inserted into machine learning models for example on sklearn or tensorflow. It contain functions make_sequences, create_inputs and create_tests_outputs

Example for n_steps_in = 3 and n_steps_out = 1

From [[1], [2], [3], [4], [5], [6]]

Inputs: [[1, 2, 3], [2, 3, 4], [3, 4, 5]] Outputs [[4], [5], [6]]

Also multivariate data can be used.

import mydatapreprocessing as mdp

data = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 12 ,13, 14 ,15, 16], [17 ,18 ,19, 20, 21, 22, 23, 24]]).T
X, y, x_input, _ = mdp.inputs.make_sequences(data, n_steps_in= 2, n_steps_out=3)

# This example create from such a array:

# data = array([[1, 9, 17],
#               [2, 10, 18],
#               [3, 11, 19],
#               [4, 12, 20],
#               [5, 13, 21],
#               [6, 14, 22],
#               [7, 15, 23],
#               [8, 16, 24]])

# Such a results (data are serialized).

# X = array([[1, 2, 3, 9, 10, 11, 17, 18, 19],
#            [2, 3, 4, 10, 11, 12, 18, 19, 20],
#            [3, 4, 5, 11, 12, 13, 19, 20, 21],
#            [4, 5, 6, 12, 13, 14, 20, 21, 22]])

# y = array([[4, 5],
#            [5, 6],
#            [6, 7],
#            [7, 8]]

# x_input = array([[ 6,  7,  8, 14, 15, 16, 22, 23, 24]])

Third module is generatedata. It generate some basic data like sin, ramp random. In the future, it will also import some real datasets for models KPI.

Example

import mydatapreprocessing as mdp

data = mdp.generatedata.gen_sin(1000)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mydatapreprocessing-1.1.22.tar.gz (25.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mydatapreprocessing-1.1.22-py3.7.egg (55.4 kB view details)

Uploaded Egg

mydatapreprocessing-1.1.22-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file mydatapreprocessing-1.1.22.tar.gz.

File metadata

  • Download URL: mydatapreprocessing-1.1.22.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.1

File hashes

Hashes for mydatapreprocessing-1.1.22.tar.gz
Algorithm Hash digest
SHA256 95fd752801d2ef16fcda5360ce206a5084e5cbc928814c4748242654ab9403c5
MD5 80fb00c98f2f56c82de45304d0657601
BLAKE2b-256 295222979e55f24eb8f982859181aab51b408a4d1419def626ef98d8161a183d

See more details on using hashes here.

File details

Details for the file mydatapreprocessing-1.1.22-py3.7.egg.

File metadata

  • Download URL: mydatapreprocessing-1.1.22-py3.7.egg
  • Upload date:
  • Size: 55.4 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.1

File hashes

Hashes for mydatapreprocessing-1.1.22-py3.7.egg
Algorithm Hash digest
SHA256 ffe842fb975aca0201508d1ce2b4296f74d935145353c93d52ee79e4c0e9a4e4
MD5 fceb16efc12c12037a4966524a25c65e
BLAKE2b-256 5fb6df037c332eb0ccdce49feb8b23157f71a1634c0193c9947ba02687229ef8

See more details on using hashes here.

File details

Details for the file mydatapreprocessing-1.1.22-py3-none-any.whl.

File metadata

  • Download URL: mydatapreprocessing-1.1.22-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.1

File hashes

Hashes for mydatapreprocessing-1.1.22-py3-none-any.whl
Algorithm Hash digest
SHA256 709fc103247ed97890b5f7a04375f67e6767e254d5ae118a8360de3319d29680
MD5 b1582f27836dc04071b4129f8628987f
BLAKE2b-256 27c3d6823fe411aebf108b31443987c4261ca6b26894d953f286f719edbf328e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page