PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.
Project description
PdpCLI
Quick Links
Introduction
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.
Features
- Process pandas DataFrame from CLI without wrting Python scripts
- Support multiple configuration file formats: YAML, JSON, Jsonnet
- Read / write data files in the following formats: CSV, TSV, JSONL
- Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
- Extensible pipeline and data readers / writers
Installation
Installing the library is simple using pip.
$ pip install "pdpcli[all]"
Tutorial
Basic Usage
- Write a pipeline config file
config.yml
like below. Thetype
fields underpipeline
correspond to the snake-cased class names of thePdpipelineStages
. The other fields such asstage
andcolumns
are the parameters of the__init__
methods of the corresponging classes. Internally, this configuration file is converted to Python objects bycolt
.
pipeline:
type: pipeline
stages:
drop_columns:
type: col_drop
columns:
- name
- job
encode:
type: one_hot_encode
columns: sex
tokenize:
type: tokenize_text
columns: content
vectorize:
type: tfidf_vectorize_token_lists
column: content
max_features: 10
- Build a pipeline by training on
train.csv
. The following command generage a pickled pipeline filepipeline.pkl
after training. If you specify URL for file path, it will be automatically downloaded and cached.
$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
- Apply the fitted pipeline to
test.csv
and get output of the processed fileprocessed_test.jsonl
by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.
$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
- You can also directly run the pipeline from a config file if you don't need to fit the pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
- It is possible to override or add parameters via command line:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name
Data Reader / Writer
PdpCLI automatically detects a suitable data reader / writer based on the file name.
If you need to use the other data reader / writer, add reader
or writer
configs to config.yml
.
The following config is an exmaple to use SQL data reader.
SQL reader fetches records from the specified database and converts them into a pandas DataFrame.
reader:
type: sql
dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database
The config file is interpreted by OmegaConf, so ${env:...}
s are interpolated by environment variables.
Prepare yuor SQL file query.sql
to fetch data from the database:
select * from your_table limit 1000
You can execute the pipeline with SQL data reader via:
$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql
Plugins
By using plugins, you can extend PdpCLI. The plugin feature enables you to use your own pipeline stages, data readers / writers and commands.
Add a new stage
- Write your plugin script
mypdp.py
like below.Stage.register("<stage-name>")
registers your pipeline stages, and you can specify these stages by writingtype: <stage-name>
in your config file.
import pdpcli
@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
def _prec(self, df):
return True
def _transform(self, df, verbose):
print(df.to_string(index=False))
return df
- Update
config.yml
to use your plugin.
pipeline:
type: pipeline
stages:
drop_columns:
...
print:
type: print
encode:
...
- Execute command with
--module mypdp
and you can see the processed DataFrame after runningdrop_columns
.
$ pdp apply config.yml test.csv --module mypdp
Add a new command
You can also add new coomands not only stages.
- Add the following script to
mypdp.py
. Thisgreet
command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
name="greet",
description="say hello",
help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
requires_plugins = False
def set_arguments(self):
self.parser.add_argument("--name", default="world")
def run(self, args):
print(f"Hello, {args.name}!")
- To register this command, you need to create the
.pdpcli_plugins
file which module names are listed in for each line. Due to the module import order, the--module
option is unavailable for the command registration.
$ echo "mypdp" > .pdpcli_plugins
- Run the following command and get the message like below. By using the
.pdpcli_plugins
file, it is is not needed to add the--module
option to a command line for each execution.
$ pdp greet --name altescy
Hello, altescy!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdpcli-0.3.0.tar.gz
.
File metadata
- Download URL: pdpcli-0.3.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b486ec119343fe5b00752f9eaec705e5ca18ea6fb183b79b231baf3d38347088 |
|
MD5 | 1cdfb828a62f4fd89f7df26dba707370 |
|
BLAKE2b-256 | 0c8a9f8b61ca541cf2a66923705c7a5079d4e950d98e7b548d25a97a9d0b9065 |
File details
Details for the file pdpcli-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: pdpcli-0.3.0-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f174ef324e041b3186726132d20cd57c4f0bbed08890455998f599c8a956ceb |
|
MD5 | ddcb0f652c9e6b69f1e61d7e41c59252 |
|
BLAKE2b-256 | d9dc320bb4669952a2f99f2cd44283318e9f348e86b8416e1aaa8ccdcaf4b3aa |