Skip to main content

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Project description

PdpCLI

Actions Status Python version PyPI version License

Quick Links

Introduction

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers/ writers by using your own python scripts.

Features

  • Process pandas DataFrame from CLI without wrting Python scripts
  • Support multiple configuration file formats: YAML, JSON, Jsonnet
  • Read / write data files in the following formats: CSV, TSV, JSONL, XLSX
  • Extensible pipeline and data readers / writers

Tutorial

Basic Usage

  1. Write a pipeline config file config.yml like below. The type fields under pipeline correspond to the snake-cased class names of the PdpipelineStages. The other fields such as stage and columns specify the parameters of the __init__ methods of the corresponging classes. Internally, this configuration file is converted to Python objects by colt.
pipeline:
  type: pipeline
  stages:
    drop_columns:
      type: col_drop
      columns: foo

    encode:
      type: one_hot_encode
      columns: sex

    tokenize:
      type: tokenize_text
      columns: profile

    vectorize:
      type: tfidf_vectorize_token_lists
      column: profile
  1. Build a pipeline by training on train.csv. The following command generage a pickled pipeline file pipeline.pkl after training.
$ pdp build config.yml pipeline.pkl --input-file train.csv
  1. Apply fitted pipeline to test.csv and output the processed file processed_test.jsonl by the following command. PdpCLI automatically detects the output file format based on the file name. In the following example, processed DataFrame will be exported as the JSONL format.
$ pdp apply pipeline.pkl test.csv --output-file processed_test.jsonl
  1. You can also directly run the pipeline from a config file if you don't need to fit the pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdpcli-0.1.0.tar.gz (10.0 kB view hashes)

Uploaded Source

Built Distribution

pdpcli-0.1.0-py3-none-any.whl (13.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page