Skip to main content

Tell pandas what to do – easy tabular data I/O playbooks

Project description

runpandarun on pypi Python test and package pre-commit Coverage Status MIT License

Run Panda Run

:panda_face: :panda_face: :panda_face: :panda_face: :panda_face: :panda_face: :panda_face:

A simple interface written in python for reproducible i/o workflows around tabular data via pandas DataFrame specified via yaml "playbooks".

Apart from any pandas function possible that can alter data, also datapatch is included for an additional and easier way to patch data.

NOTICE

As of july 2023, this package only handles pandas transform logic, no data warehousing anymore. See archived version

Quickstart

Install via pip

Specify your operations via yaml syntax:

read:
  uri: ./data.csv
  options:
    skiprows: 3

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount
  - handler: Series.map
    column: slug
    options:
      func: "lambda x: normality.slugify(x) if isinstance(x, str) else 'NO DATA'"

store this as a file pandas.yml, and apply a data source:

cat data.csv | runpandarun pandas.yml > data_transformed.csv

Or, use within your python scripts:

from runpandarun import Playbook

play = Playbook.from_yaml("./pandas.yml")
df = play.run()  # get the transformed dataframe

# change playbook parameters on run time:
play.read.uri = "s3://my-bucket/data.csv"
df = play.run()
df.to_excel("./output.xlsx")

# the play can be applied directly to a data frame,
# this allows more granular control
df = get_my_data_from_somewhere_else()
df = play.run(df)

Installation

Requires at least python3.10 Virtualenv use recommended.

Additional dependencies (pandas et. al.) will be installed automatically:

pip install runpandarun

After this, you should be able to execute in your terminal:

runpandarun --help

Reference

The playbook can be programmatically obtained in different ways:

from runpandarun import Playbook

# via yaml file
play = Playbook.from_yaml('./path/to/config.yml')

# via yaml string
play = Playbook.from_string("""
operations:
- handler: DataFrame.sort_values
  options:
    by: my_sort_column
""")

# directly via the Playbook object (which is a pydantic object)
play = Playbook(operations=[{
    "handler": "DataFrane.sort_values",
    "options": {"by": "my_sort_column"}
}])

All options within the Playbook are optional, if you apply an empty play to a DataFrame, it will just remain untouched (but runpandarun won't break)

The playbook has three sections:

  • read: instructions for reading in a source dataframe
  • operations: a list of functions with their options (kwargs) executed in the given order
  • write: instructions for saving a transformed dataframe to a target

Read and write

pandas can read and write from many local and remote sources and targets.

More information about handlers and their options: Pandas IO tools

For example, you could transform a source from s3 to a sftp endpoint:

runpandarun pandas.yml -i s3://my_bucket/data.csv -o sftp://user@host/data.csv

you can overwrite the uri arguments in the command line with -i / --in-uri and -o / --out-uri

read:
  uri: s3://my-bucket/data.xls  # input uri, anything that pandas can read
  handler: read_excel           # default: guess by file extension, fallback: read_csv
  options:                      # options for the handler
    skiprows: 2

write:
  uri: ./data.xlsx              # output uri, anything that pandas can write to
  handler: write_excel          # default: guess by file extension, fallback: write_csv
  options:                      # options for the handler
    index: false

Operations

The operations key of the yaml spec holds the transformations that should be applied to the data in order.

An operation can be any function from pd.DataFrame or pd.Series. Refer to these documentations to see their possible options (as in **kwargs).

For the handler, specify the module path without a pd or pandas prefix, just DataFrame.<func> or Series.<func>. When using a function that applies to a Series, tell :panda_face: which one to use via the column prop.

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount

This exactly represents this python call to the processed dataframe:

df.rename(columns={"value": "amount"})

env vars

For api keys or other secrets, you can put environment variables anywhere into the config. They will simply resolved via os.path.expandvars

read:
  options:
    storage_options:
      header:
        "api-key": ${MY_API_KEY}

Example

A full playbook example that covers a few of the possible cases.

See the yaml files in ./tests/fixtures/ for more.

read:
  uri: https://api.example.org/data?format=csv
  options:
    storage_options:
      header:
        "api-key": ${API_KEY}
    skipfooter: 1

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount

  - handler: Series.str.lower
    column: state

  - handler: DataFrame.assign
    options:
      city_id: "lambda x: x['state'] + '-' + x['city'].map(normality.slugify)"

  - handler: DataFrame.set_index
    options:
      keys:
        - city_id

  - handler: DataFrame.sort_values
    options:
      by:
        - state
        - city

patch:
  city:
    options:
      - match: Zarizri
        value: Zar1zr1

write:
  uri: ftp://user:${FTP_PASSWORD}@host/data.csv
  options:
    index: false

How to...

Rename columns

DataFrame.rename

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount
        "First name": first_name

Apply modification to a column

Series.map

operations:
  - handler: Series.map
    column: my_column
    options:
      func: "lambda x: x.lower()"

Set an index

DataFrame.set_index

operations:
  - handler: DataFrame.set_index
    options:
      keys:
        - city_id

Sort values

DataFrame.sort_values

operations:
  - sort_values:
      by:
        - column1
        - column2
      ascending: false

De-duplicate

DataFrame.drop_duplicates

when using a subset of columns, use in conjunction with sort_values to make sure to keep the right records

operations:
  - drop_duplicates:
      subset:
        - column1
        - column2
      keep: last

Compute a new column based on existing data

DataFrame.assign

operations:
  - handler: DataFrame.assign
    options:
      city_id: "lambda x: x['state'] + '-' + x['city'].map(normality.slugify)"

SQL

Pandas SQL io

read:
  uri: postgresql://user:password@host/database
  options:
    sql: "SELECT * FROM my_table WHERE category = 'A'"

Patch data

Apart from any pandas function possible that can alter data, also datapatch is included for an additional and easier way to patch data.

Simply add a patch config to the yaml. Refer to the datapatch readme for details.

The patching is applied after all the operations are applied.

patch:
  countries:
    normalize: true
    lowercase: true
    options:
      - match: Frankreich
        value: France
      - match:
          - Northkorea
          - Nordkorea
          - Northern Korea
          - NKorea
          - DPRK
        value: North Korea
      - contains: Britain
        value: Great Britain

save eval

Ok wait, you are executing arbitrary python code in the yaml specs?

Not really, there is a strict allow list of possible modules that can be used. See runpandarun.util.safe_eval

This includes:

So, this would, of course, NOT WORK (as tested here)

operations:
  - handler: DataFrame.apply
    func: "__import__('os').system('rm -rf /')"

development

Package is managed via Poetry

git clone https://github.com/investigativedata/runpandarun

Install requirements:

poetry install --with dev

Test:

make test

Funding

Since July 2023, this project is part of investigraph and development of this project is funded by

Media Tech Lab Bayern batch #3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runpandarun-0.7.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

runpandarun-0.7.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file runpandarun-0.7.0.tar.gz.

File metadata

  • Download URL: runpandarun-0.7.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.1 CPython/3.13.5 Linux/6.12.63+deb13-amd64

File hashes

Hashes for runpandarun-0.7.0.tar.gz
Algorithm Hash digest
SHA256 7a7c39b8a475f381001a90f04dd4b21082526c69bd3571138388fe141bf42e00
MD5 687b3350b5fc0e747aa2a1cbfb9e31a5
BLAKE2b-256 43f20fc84f5118017564be1528bc4a5500429655ffb5c82bb540d8619ad17886

See more details on using hashes here.

File details

Details for the file runpandarun-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: runpandarun-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.1 CPython/3.13.5 Linux/6.12.63+deb13-amd64

File hashes

Hashes for runpandarun-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 517a9ffa94fc1411ac1f3092b998c491193e37e8c1da4689c5cd5b8f58c5b50c
MD5 81ff6cd45809254a942f508b3056a32b
BLAKE2b-256 50cac66f04656ca5e1b3ef0302aebf0ceeebdc9310f45806272e06ec10c33366

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page