Skip to main content

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Project description

PdpCLI

Actions Status Python version PyPI version License

Quick Links

Introduction

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.

Features

  • Process pandas DataFrame from CLI without wrting Python scripts
  • Support multiple configuration file formats: YAML, JSON, Jsonnet
  • Read / write data files in the following formats: CSV, TSV, JSONL
  • Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
  • Extensible pipeline and data readers / writers

Installation

Installing the library is simple using pip.

$ pip install "pdpcli[all]"

Tutorial

Basic Usage

  1. Write a pipeline config file config.yml like below. The type fields under pipeline correspond to the snake-cased class names of the PdpipelineStages. The other fields such as stage and columns are the parameters of the __init__ methods of the corresponging classes. Internally, this configuration file is converted to Python objects by colt.
pipeline:
  type: pipeline
  stages:
    drop_columns:
      type: col_drop
      columns:
        - name
        - job

    encode:
      type: one_hot_encode
      columns: sex

    tokenize:
      type: tokenize_text
      columns: content

    vectorize:
      type: tfidf_vectorize_token_lists
      column: content
      max_features: 10
  1. Build a pipeline by training on train.csv. The following command generage a pickled pipeline file pipeline.pkl after training. If you specify URL for file path, it will be automatically downloaded and cached.
$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
  1. Apply the fitted pipeline to test.csv and get output of the processed file processed_test.jsonl by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.
$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
  1. You can also directly run the pipeline from a config file if you don't need to fit the pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
  1. It is possible to override or add parameters via command line:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name

Data Reader / Writer

PdpCLI automatically detects a suitable data reader / writer based on the file name. If you need to use the other data reader / writer, add reader or writer configs to config.yml. The following config is an exmaple to use SQL data reader. SQL reader fetches records from the specified database and converts them into a pandas DataFrame.

reader:
    type: sql
    dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database

The config file is interpreted by OmegaConf, so ${env:...}s are interpolated by environment variables.

Prepare yuor SQL file query.sql to fetch data from the database:

select * from your_table limit 1000

You can execute the pipeline with SQL data reader via:

$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql

Plugins

By using plugins, you can extend PdpCLI. The plugin feature enables you to use your own pipeline stages, data readers / writers and commands.

Add a new stage

  1. Write your plugin script mypdp.py like below. Stage.register("<stage-name>") registers your pipeline stages, and you can specify these stages by writing type: <stage-name> in your config file.
import pdpcli

@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
    def _prec(self, df):
        return True

    def _transform(self, df, verbose):
        print(df.to_string(index=False))
        return df
  1. Update config.yml to use your plugin.
pipeline:
    type: pipeline
    stages:
        drop_columns:
        ...

        print:
            type: print

        encode:
        ...
  1. Execute command with --module mypdp and you can see the processed DataFrame after running drop_columns.
$ pdp apply config.yml test.csv --module mypdp

Add a new command

You can also add new coomands not only stages.

  1. Add the following script to mypdp.py. This greet command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
    name="greet",
    description="say hello",
    help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
    requires_plugins = False

    def set_arguments(self):
        self.parser.add_argument("--name", default="world")

    def run(self, args):
        print(f"Hello, {args.name}!")
  1. To register this command, you need to create the.pdpcli_plugins file which module names are listed in for each line. Due to the module import order, the --module option is unavailable for the command registration.
$ echo "mypdp" > .pdpcli_plugins
  1. Run the following command and get the message like below. By using the .pdpcli_plugins file, it is is not needed to add the --module option to a command line for each execution.
$ pdp greet --name altescy
Hello, altescy!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdpcli-0.3.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

pdpcli-0.3.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file pdpcli-0.3.0.tar.gz.

File metadata

  • Download URL: pdpcli-0.3.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure

File hashes

Hashes for pdpcli-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b486ec119343fe5b00752f9eaec705e5ca18ea6fb183b79b231baf3d38347088
MD5 1cdfb828a62f4fd89f7df26dba707370
BLAKE2b-256 0c8a9f8b61ca541cf2a66923705c7a5079d4e950d98e7b548d25a97a9d0b9065

See more details on using hashes here.

File details

Details for the file pdpcli-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pdpcli-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure

File hashes

Hashes for pdpcli-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f174ef324e041b3186726132d20cd57c4f0bbed08890455998f599c8a956ceb
MD5 ddcb0f652c9e6b69f1e61d7e41c59252
BLAKE2b-256 d9dc320bb4669952a2f99f2cd44283318e9f348e86b8416e1aaa8ccdcaf4b3aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page