PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.
Project description
PdpCLI
Quick Links
Introduction
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers/ writers by using your own python scripts.
Features
- Process pandas DataFrame from CLI without wrting Python scripts
- Support multiple configuration file formats: YAML, JSON, Jsonnet
- Read / write data files in the following formats: CSV, TSV, JSONL, XLSX
- Extensible pipeline and data readers / writers
Installation
Installing the library is simple using pip.
$ pip install pdpcli
Tutorial
Basic Usage
- Write a pipeline config file
config.yml
like below. Thetype
fields underpipeline
correspond to the snake-cased class names of thePdpipelineStages
. The other fields such asstage
andcolumns
specify the parameters of the__init__
methods of the corresponging classes. Internally, this configuration file is converted to Python objects bycolt
.
pipeline:
type: pipeline
stages:
drop_columns:
type: col_drop
columns: foo
encode:
type: one_hot_encode
columns: sex
tokenize:
type: tokenize_text
columns: profile
vectorize:
type: tfidf_vectorize_token_lists
column: profile
- Build a pipeline by training on
train.csv
. The following command generage a pickled pipeline filepipeline.pkl
after training.
$ pdp build config.yml pipeline.pkl --input-file train.csv
- Apply fitted pipeline to
test.csv
and get output of the processed fileprocessed_test.jsonl
by the following command. PdpCLI automatically detects the output file format based on the file name. In the following example, processed DataFrame will be exported as the JSONL format.
$ pdp apply pipeline.pkl test.csv --output-file processed_test.jsonl
- You can also directly run the pipeline from a config file if you don't need to fit the pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
- It is possible to change parameters via command line:
pdp apply.yml test.csv pipeline.stages.drop_columns.column=age
Data Reader / Writer
Plugins
By using plugins, you can extend PdpCLI. The plugin feature enables you to use your own pipeline stages, data reader / writer and commands.
Add a new stage
- Write your plugin script
mypdp.py
like the following.PrintStage
just shows the DataFrame on stdout.
import pdpcli
@pdpcli.PdPipelineStage.register("print")
class PrintStage(pdpcli.PdPipelineStage):
def _prec(self, df):
return True
def _transform(self, df, verbose):
print(df.to_string(index=False))
return df
- Update
config.yml
to use your plugin.
pipeline:
type: pipeline
stages:
drop_columns:
...
print:
type: print
encode:
...
- Execute command with
--module mypdp
and you can see the DataFrame afterdrop_columns
.
$ pdp apply config.yml test.csv --module mypdp
Add a new command
You can also add new coomands not only stages.
- Add the following script to
mypdp.py
. Thisgreet
command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
name="greet",
description="say hello",
help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
requires_plugins = False
def set_arguments(self):
self.parser.add_argument("--name", default="world")
def run(self, args):
print(f"Hello, {args.name}!")
- To register this command, you need to create the
.pdpcli_plugins
file. Due to the module import order, the--module
option is unavailable for command registration.
$ echo "mypdp" > .pdpcli_plugins
- Run the following command and get the message like below. By using the
.pdpcli_plugins
file, it is unnecessary to enter the--module
option for each execution.
$ pdp greet --name altescy
Hello, altescy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdpcli-0.2.0.tar.gz
(12.2 kB
view hashes)
Built Distribution
pdpcli-0.2.0-py3-none-any.whl
(15.1 kB
view hashes)