Skip to main content

Framework to manipulate dataframes fluidly in a pipeline.

Project description

# Data Pipe ML
Pipeline API to manipulate dataframes for machine learning.

Data Pipe is a framework that wraps Pandas Data Frames to provide a more fluid method to manipulate data.

Basic concepts:
- Every operation is performed in place. The Data Pipe object keeps one and only one reference to a pandas Data Frame that is constantly updated.
- ‎Every operation returns a reference to self, which allows chaining methods fluidly.
- Every method called is recorded internally to provide improved reproducibility and understanding of the preparation pipeline. The exception is the "load" method.
- ‎Data Pipe calls of unimplemented methods default to the internal Data Frame object. This allows quickly accessing some methods, such as shape and head, but please be aware that those calls are not recorded and do not return Data Pipe objects. If it's necessary to use an unimplemented function, please use the Update method to keep manipulating the Data Pipe.

## Example

### Full pipeline with time split
```
>>> from datapipeml import DataPipe

>>> X, y = DataPipe.load("data/kiva_loans_sample.csv.gz")\
>>> .anonymize("id")\
>>> .set_index("id")\
>>> .drop("tags")\
>>> .drop_sparse()\
>>> .drop_duplicates()\
>>> .fill_null()\
>>> .remove_outliers()\
>>> .normalize()\
>>> .set_one_hot()\
>>> .split_train_test(by="date")

Anonymizing id
No sparse columns to drop
Found 0 duplicated rows
Fillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
Removing outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
Normalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
Encoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'repayment_interval']

>>> X.keep_numerics()
>>> y.keep_numerics()

Dropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}
Dropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}

>>> print(X.summary())
___________________________________________________________|
Method Name |Args |Kwargs |
___________________________________________________________|
anonymize |('id',) |{} |
set_index |('id',) |{} |
drop |('tags',) |{} |
drop_sparse |() |{} |
drop_duplicates |() |{} |
fill_null |() |{} |
remove_outliers |() |{} |
normalize |() |{} |
set_one_hot |() |{} |
split_train_test |() |{'by': 'date'} |
keep_numerics |() |{} |
___________________________________________________________|
```

### Create target column and stratified folds
```
>>> folds = DataPipe.load("data/kiva_loans_sample.csv.gz")\
>>> .set_index("id")\
>>> .drop_duplicates()\
>>> .fill_null()\
>>> .remove_outliers()\
>>> .normalize()\
>>> .set_one_hot()\
>>> .create_column("high_loan", lambda x: 1 if x["loan_amount"] > 2000 else 0)\
>>> .keep_numerics()\
>>> .create_folds(stratify_by="high_loan")

Found 0 duplicated rows
Fillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
Removing outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
Normalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']
One-hot encoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'borrower_genders', 'repayment_interval']
Creating column high_loan
Dropping columns {'tags', 'funded_time', 'disbursed_time', 'region', 'use', 'posted_time', 'date'}
```

Project details


Release history Release notifications | RSS feed

This version

0.8

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapipeml-0.8.tar.gz (10.7 kB view details)

Uploaded Source

File details

Details for the file datapipeml-0.8.tar.gz.

File metadata

  • Download URL: datapipeml-0.8.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for datapipeml-0.8.tar.gz
Algorithm Hash digest
SHA256 64358e5cc8e75c5694ba55867a81b604dcf1a4115c398316df56bf2fc50fb131
MD5 a33b4f32f85fc004a0ab02f76d182f4b
BLAKE2b-256 1e96ab3ab7e2fd329e0af03a16b62cd44f6e8e51d168723f4b7c64df04706ccd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page