PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.
Project description
PdpCLI
Quick Links
Introduction
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers/ writers by using your own python scripts.
Features
- Process pandas DataFrame from CLI without wrting Python scripts
- Support multiple configuration file formats: YAML, JSON, Jsonnet
- Read / write data files in the following formats: CSV, TSV, JSONL, XLSX
- Extensible pipeline and data readers / writers
Tutorial
Basic Usage
- Write a pipeline config file
config.yml
like below. Thetype
fields underpipeline
correspond to the snake-cased class names of thePdpipelineStages
. The other fields such asstage
andcolumns
specify the parameters of the__init__
methods of the corresponging classes. Internally, this configuration file is converted to Python objects bycolt
.
pipeline:
type: pipeline
stages:
drop_columns:
type: col_drop
columns: foo
encode:
type: one_hot_encode
columns: sex
tokenize:
type: tokenize_text
columns: profile
vectorize:
type: tfidf_vectorize_token_lists
column: profile
- Build a pipeline by training on
train.csv
. The following command generage a pickled pipeline filepipeline.pkl
after training.
$ pdp build config.yml pipeline.pkl --input-file train.csv
- Apply fitted pipeline to
test.csv
and output the processed fileprocessed_test.jsonl
by the following command. PdpCLI automatically detects the output file format based on the file name. In the following example, processed DataFrame will be exported as the JSONL format.
$ pdp apply pipeline.pkl test.csv --output-file processed_test.jsonl
- You can also directly run the pipeline from a config file if you don't need to fit the pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdpcli-0.1.0.tar.gz
(10.0 kB
view hashes)
Built Distribution
pdpcli-0.1.0-py3-none-any.whl
(13.6 kB
view hashes)