Skip to main content

Schematized pipeline operations on dataframes

Project description

dfbridge

A Schematized dataframe formatter.

We often have need to reformat a base dataframe to create a dataframe following a schema, applying a combination of renaming some columns, applying functions to others, and doing groupby/transform operations. These steps introduce a lot of boilerplate, but here we can assign it as a dictionary schema. The original dataframe is unchanged, and all of the operations take place only on the original dataframe.

Let's say we want the output dataframe to have columns final_name1, final_name2, and final_name3, with one of them a simple rename from an input dataframe, one the result of some fucntion applied to the input dataframe, and one some groupby transform operation. We can even remap values to other values in the process. Setting fill_missing to True lets one add the column and set it as full of pandas NA values.

The schema to do this looks like:

schema = {
    "final_name1`": {
        "type": "rename",
        "from": "original_name",
        "fill_missing": True,
        "column_type": None,
        'remap_dict': {'orig_val': 'new_val'}, # Remaps elements with original val to new val. Set to None or ignore to not use.
        'strict_remap': True, # If True, values not in the remap_dict are made pd.NA, else are passed through intact.
    },
    "final_name2": {
        "type": "apply",
        "func": function, # Expects the whole row of the original dataframe, so use row['col] style arguments.
        "fill_missing": True,
        "column_type": None,
        'remap_dict': None, # Remaps elements with original val to new val. Set to None or ignore to not use.
    },
    "final_name3": {
        "type": "transform",
        "groupby": "groupby_column",
        "column": "return_column",
        "action`": "mean", # (or anything that works in df.groupby().transform())
        "fill_missing": True,
        "column_type": None,
    },
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfbridge-0.0.2.tar.gz (5.5 kB view details)

Uploaded Source

File details

Details for the file dfbridge-0.0.2.tar.gz.

File metadata

  • Download URL: dfbridge-0.0.2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/60.2.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.12

File hashes

Hashes for dfbridge-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2f1e9f187b9fc187222b912c932fc1b79c54c49acc78ba674c8aae50281cbffc
MD5 ccf8fb17080a42bb1f068ea93bcd309b
BLAKE2b-256 2a107680d63b0c6e533ccc622f1531a30a883e6a7d952e8542c52fed64473c6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page