Tell pandas what to do – easy tabular data I/O playbooks

These details have not been verified by PyPI

Project links

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Run Panda Run

:panda_face: :panda_face: :panda_face: :panda_face: :panda_face: :panda_face: :panda_face:

A simple interface written in python for reproducible i/o workflows around tabular data via pandas DataFrame specified via yaml "playbooks".

NOTICE

As of july 2023, this package only handles pandas transform logic, no data warehousing anymore. See archived version

Quickstart

Install via pip

Specify your operations via yaml syntax:

read:
  uri: ./data.csv
  options:
    skiprows: 3

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount
  - handler: Series.map
    column: slug
    options:
      func: "lambda x: normality.slugify(x) if isinstance(x) else 'NO DATA'"

store this as a file pandas.yml, and apply a data source:

cat data.csv | runpandarun pandas.yml > data_transformed.csv

Or, use within your python scripts:

from runpandarun import Playbook

play = Playbook.from_yaml("./pandas.yml")
df = play.run()  # get the transformed dataframe

# change playbook parameters on run time:
play.read.uri = "s3://my-bucket/data.csv"
df = play.run()
df.to_excel("./output.xlsx")

# the play can be applied directly to a data frame,
# this allows more granular control
df = get_my_data_from_somewhere_else()
df = play.run(df)

Installation

Requires at least python3.10 Virtualenv use recommended.

Additional dependencies (pandas et. al.) will be installed automatically:

pip install runpandarun

After this, you should be able to execute in your terminal:

runpandarun --help

Reference

The playbook can be programmatically obtained in different ways:

from runpandarun import Playbook

# via yaml file
play = Playbook.from_yaml('./path/to/config.yml')

# via yaml string
play = Playbook.from_string("""
operations:
- handler: DataFrame.sort_values
  options:
    by: my_sort_column
""")

# directly via the Playbook object (which is a pydantic object)
play = Playbook(operations=[{
    "handler": "DataFrane.sort_values",
    "options": {"by": "my_sort_column"}
}])

All options within the Playbook are optional, if you apply an empty play to a DataFrame, it will just remain untouched (but runpandarun won't break)

The playbook has three sections:

read: instructions for reading in a source dataframe
operations: a list of functions with their options (kwargs) executed in the given order
write: instructions for saving a transformed dataframe to a target

Read and write

pandas can read and write from many local and remote sources and targets.

More information about handlers and their options: Pandas IO tools

For example, you could transform a source from s3 to a sftp endpoint:

runpandarun pandas.yml -i s3://my_bucket/data.csv -o sftp://user@host/data.csv

you can overwrite the uri arguments in the command line with -i / --in-uri and -o / --out-uri

read:
  uri: s3://my-bucket/data.xls  # input uri, anything that pandas can read
  handler: read_excel           # default: guess by file extension, fallback: read_csv
  options:                      # options for the handler
    skiprows: 2

write:
  uri: ./data.xlsx              # output uri, anything that pandas can write to
  handler: write_excel          # default: guess by file extension, fallback: write_csv
  options:                      # options for the handler
    index: false

Operations

The operations key of the yaml spec holds the transformations that should be applied to the data in order.

An operation can be any function from pd.DataFrame or pd.Series. Refer to these documentations to see their possible options (as in **kwargs).

For the handler, specify the module path without a pd or pandas prefix, just DataFrame.<func> or Series.<func>. When using a function that applies to a Series, tell :panda_face: which one to use via the column prop.

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount

This exactly represents this python call to the processed dataframe:

df.rename(columns={"value": "amount"})

env vars

For api keys or other secrets, you can put environment variables anywhere into the config. They will simply resolved via os.path.expandvars

read:
  options:
    storage_options:
      header:
        "api-key": ${MY_API_KEY}

Example

A full playbook example that covers a few of the possible cases.

See the yaml files in ./tests/fixtures/ for more.

read:
  uri: https://api.example.org/data?format=csv
  options:
    storage_options:
      header:
        "api-key": ${API_KEY}
    skipfooter: 1

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount

  - handler: Series.str.lower
    column: state

  - handler: DataFrame.assign
    options:
      city_id: "lambda x: x['state'] + '-' + x['city'].map(normality.slugify)"

  - handler: DataFrame.set_index
    options:
      keys:
        - city_id

  - handler: DataFrame.sort_values
    options:
      by:
        - state
        - city

write:
  uri: ftp://user:${FTP_PASSWORD}@host/data.csv
  options:
    index: false

How to...

Rename columns

DataFrame.rename

operations:
  - handler: DataFrame.rename
    options:
      columns:
        value: amount
        "First name": first_name

Apply modification to a column

Series.map

operations:
  - handler: Series.map
    column: my_column
    options:
      func: "lambda x: x.lower()"

Set an index

DataFrame.set_index

operations:
  - handler: DataFrame.set_index
    options:
      keys:
        - city_id

Sort values

DataFrame.sort_values

operations:
  - sort_values:
      by:
        - column1
        - column2
      ascending: false

De-duplicate

DataFrame.drop_duplicates

when using a subset of columns, use in conjunction with sort_values to make sure to keep the right records

operations:
  - drop_duplicates:
      subset:
        - column1
        - column2
      keep: last

Compute a new column based on existing data

DataFrame.assign

operations:
  - handler: DataFrame.assign
    options:
      city_id: "lambda x: x['state'] + '-' + x['city'].map(normality.slugify)"

SQL

Pandas SQL io

read:
  uri: postgresql://user:password@host/database
  options:
    sql: "SELECT * FROM my_table WHERE category = 'A'"

save eval

Ok wait, you are executing arbitrary python code in the yaml specs?

Not really, there is a strict allow list of possible modules that can be used. See runpandarun.util.safe_eval

This includes:

any pandas or numpy modules
normality
fingerprints

So, this would, of course, NOT WORK (as tested here)

operations:
  - handler: DataFrame.apply
    func: "__import__('os').system('rm -rf /')"

development

Package is managed via Poetry

git clone https://github.com/investigativedata/runpandarun

Install requirements:

poetry install --with dev

Test:

make test

Funding

Since July 2023, this project is part of investigraph and development of this project is funded by

Media Tech Lab Bayern batch #3

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.5.1

Sep 6, 2024

0.5.0

Jul 11, 2024

0.4.3

Jun 4, 2024

0.4.2

Feb 23, 2024

0.4.1

Feb 8, 2024

0.4.0

Nov 12, 2023

0.3.4 yanked

Nov 12, 2023

Reason this release was yanked:

i was dumb

0.3.3 yanked

Nov 12, 2023

Reason this release was yanked:

i was dumb

0.3.2

Oct 3, 2023

0.3.1

Sep 7, 2023

0.3.0

Aug 1, 2023

0.2.5

Jul 24, 2023

This version

0.2.4

Jul 19, 2023

0.2.3

Jul 19, 2023

0.2.2

Jul 18, 2023

0.2.1

Jul 15, 2023

0.1.4

Apr 29, 2020

0.1.3

Apr 29, 2020

0.1.2

Apr 17, 2020

0.1.1

Apr 8, 2020

0.1

Apr 8, 2020

0.1rc4 pre-release

Apr 8, 2020

0.1rc3 pre-release

Apr 8, 2020

0.1rc1 pre-release

Apr 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runpandarun-0.2.4.tar.gz (11.4 kB view hashes)

Uploaded Jul 19, 2023 Source

Built Distribution

runpandarun-0.2.4-py3-none-any.whl (10.1 kB view hashes)

Uploaded Jul 19, 2023 Python 3

Hashes for runpandarun-0.2.4.tar.gz

Hashes for runpandarun-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`da8562b64a5d58667d84b3a73e05e0852b5884e9f0b93d823db89262c6809408`
MD5	`0467873c13a7021059ee09f6d6934b37`
BLAKE2b-256	`c07d7eac197df3c16bbeed047d6b0868c6c1fd4048804b5bda960005a6194d4d`

Hashes for runpandarun-0.2.4-py3-none-any.whl

Hashes for runpandarun-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef03599c1e06788ab658f36872d4ac43ab88c4a0810d50e182f432495a7926f0`
MD5	`62daaac276e39e878f6ce99a4c49dac5`
BLAKE2b-256	`c470d482df1055bb20d4822b57b4f2ad1db55df9ebbc77d389cff2c50d093f80`