PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.
Project description
PdpCLI
Quick Links
Introduction
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.
Features
- Process pandas DataFrame from CLI without wrting Python scripts
- Support multiple configuration file formats: YAML, JSON, Jsonnet
- Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame
- Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
- Extensible pipeline and data readers / writers
Installation
Installing the library is simple using pip.
$ pip install "pdpcli[all]"
Tutorial
Basic Usage
- Write a pipeline config file
config.yml
like below. Thetype
fields underpipeline
correspond to the snake-cased class names of thePdpipelineStages
. Other fields such asstage
andcolumns
are the parameters of the__init__
methods of the corresponging classes. Internally, this configuration file is converted to Python objects bycolt
.
pipeline:
type: pipeline
stages:
drop_columns:
type: col_drop
columns:
- name
- job
encode:
type: one_hot_encode
columns: sex
tokenize:
type: tokenize_text
columns: content
vectorize:
type: tfidf_vectorize_token_lists
column: content
max_features: 10
- Build a pipeline by training on
train.csv
. The following command generages a pickled pipeline filepipeline.pkl
after training. If you specify a URL of file path, it will be automatically downloaded and cached.
$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
- Apply the fitted pipeline to
test.csv
and get output of a processed fileprocessed_test.jsonl
by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.
$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
- You can also directly run the pipeline from a config file without fitting pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
- It is possible to override or add parameters by adding command line arguments:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name
Data Reader / Writer
PdpCLI automatically detects a suitable data reader / writer based on a given file name.
If you need to use the other data reader / writer, add a reader
or writer
config to config.yml
.
The following config is an exmaple to use SQL data reader.
SQL reader fetches records from the specified database and converts them into a pandas DataFrame.
reader:
type: sql
dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database
Config files are interpreted by OmegaConf, so ${env:...}
is interpolated by environment variables.
Prepare yuor SQL file query.sql
to fetch data from the database:
select * from your_table limit 1000
You can execute the pipeline with SQL data reader via:
$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql
Plugins
By using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.
Add a new stage
- Write your plugin script
mypdp.py
like below.Stage.register("<stage-name>")
registers your pipeline stages, and you can specify these stages by writingtype: <stage-name>
in your config file.
import pdpcli
@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
def _prec(self, df):
return True
def _transform(self, df, verbose):
print(df.to_string(index=False))
return df
- Update
config.yml
to use your plugin.
pipeline:
type: pipeline
stages:
drop_columns:
...
print:
type: print
encode:
...
- Execute command with
--module mypdp
and you can see the processed DataFrame after runningdrop_columns
.
$ pdp apply config.yml test.csv --module mypdp
Add a new command
You can also add new commands not only stages.
- Add the following script to
mypdp.py
. Thisgreet
command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
name="greet",
description="say hello",
help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
requires_plugins = False
def set_arguments(self):
self.parser.add_argument("--name", default="world")
def run(self, args):
print(f"Hello, {args.name}!")
- To register this command, you need to create the
.pdpcli_plugins
file in which module names are listed for each line. Due to module importing order, the--module
option is unavailable for command registration.
$ echo "mypdp" > .pdpcli_plugins
- Run the following command and get a message like below. By using the
.pdpcli_plugins
file, it is is not needed to add the--module
option to a command line for each execution.
$ pdp greet --name altescy
Hello, altescy!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdpcli-0.4.1.tar.gz
.
File metadata
- Download URL: pdpcli-0.4.1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.10 CPython/3.8.2 Linux/5.4.0-1059-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8989c222d32a8dfee7012e5de77cd7e4a7d6b74fe49b78df12eca8e4234fb0ed |
|
MD5 | 55e75e44d4aeb3e87b05bc3a87a09c6e |
|
BLAKE2b-256 | af93114591825dbe17c545bc1a6cfc1999c42e618bde3dc0948097473b797a77 |
File details
Details for the file pdpcli-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: pdpcli-0.4.1-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.10 CPython/3.8.2 Linux/5.4.0-1059-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a811bc2cb316144c0dac6313d10f990cbdfbfc6d3959ecc51bf783bee85fdf43 |
|
MD5 | 895dcef444dd54d7a48d0b53a9ba2562 |
|
BLAKE2b-256 | 62a6393db3cbc225f4128d3d2bda91588493ed21dd642b340ccaafcf6025f27f |