Skip to main content

Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.

Project description

Data Release Preparation Tool

:warning: This is currently at beta development stage and likely has a lot of bugs. Please use the issue tracker to report an bugs or feature requests.

Description

Command-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.

After performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.

Installation

The tool can be installed using pip:

pip install drpt

Usage

CLI

Usage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE

Options:
  -d, --dry-run           Generate only the report without the release dataset
  -v, --verbose           Verbose [Not implemented]
  -n, --nrows TEXT        Number of rows to read from a CSV file. Doesn't work
                          with parquet files.
  -l, --limits-file PATH  Limits file
  -o, --output-dir PATH   Output directory. The default output directory is
                          the same as the location of the recipe_file.
  --version               Show the version and exit.
  --help                  Show this message and exit.

Recipe Definition

Overview

The recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a version key which is appended in the generated filenames and the report.

Default recipe:

{
  "version": "",
  "actions": {
    "drop": [],
    "drop-constant-columns": false,
    "obfuscate": [],
    "disable-scaling": false,
    "skip-scaling": [],
    "rename": []
  }
}

The currently supported actions, performed in this order, are as follows:

  • drop: Column deletion
  • drop-constant-columns: Drops all columns that containt only one unique value
  • obfuscate: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.
  • Scaling: By default all columns are Min/Max scaled
    • disable-scaling: Can be used to disable scaling for all columns
    • skip-scaling: By default all columns are Min/Max scaled, except those excluded (skip-scaling)
  • rename: Column renaming

All column definitions above support regular expressions.

Actions

drop

The drop action is defined as a list of column names to be dropped.

drop-constant-columns

This is a boolean action, which when set to true will drop all the columns that have only a single unique value.

obfuscate

The obfuscate action is defined as a list of column names to be obfuscated.

disable-scaling, skip-scaling

By default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the disable-scaling action to true. If scaling must be disabled for only a set of columns these columns can be defined using the skip-scaling action, as a list of column names.

rename

The rename action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (\\1) [see example below].

{
  //...
  "rename": [{"original_name": "target_name"}]
  //...
}

Example

Input CSV file:

test1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const
1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1
2.2,2,two,2,0.555,0.4,0,b,f,2,2,1
3.3,3,three,4,0.1,5,1,c,g,3,3,1
2.22,2,two,4,1,0,2.5,d,h,4,4,1

Recipe:

{
  "version": "1.0",
  "actions": {
    "drop": ["test2", "test[8-9]"],
    "drop-constant-columns": true,
    "obfuscate": ["test3"],
    "skip-scaling": ["test4"],
    "rename": [
      { "test1": "test1_renamed" },
      { "test([3-4])": "test\\1_regex_renamed" },
      { "foo[.]bar[.].*": "foo" }
    ]
  }
}

Generated CSV file:

test1_renamed,test3_regex_renamed,test4_regex_renamed,test5,test6,test7,foo_1,foo_2
0.0,0,2,0.1488888888888889,0.06,0.0,0.0,0.0
0.5000000000000001,2,2,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333
1.0,1,4,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666
0.5090909090909091,2,4,1.0,0.0,1.0,1.0,1.0

Report:

,action,column,details
0,recipe_version,,1.0
1,drpt_version,,0.2.8
2,DROP,test2,
3,DROP,test8,
4,DROP,test9,
5,DROP_CONSTANT,const,
6,OBFUSCATE,test3,
7,SCALE_DEFAULT,test1,"[1.1,3.3]"
8,SCALE_DEFAULT,test5,"[0.1,1.0]"
9,SCALE_DEFAULT,test6,"[0.0,5.0]"
10,SCALE_DEFAULT,test7,"[-1.0,2.5]"
11,SCALE_DEFAULT,foo.bar.test,"[1,4]"
12,SCALE_DEFAULT,foo.bar.test2,"[1,4]"
13,RENAME,test1,test1_renamed
14,RENAME,test3,test3_regex_renamed
15,RENAME,test4,test4_regex_renamed
16,RENAME,foo.bar.test,foo_1
17,RENAME,foo.bar.test2,foo_2

Thanks

This tool was made possible with Pandas, PyArrow, jsonschema, and of course Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drpt-0.6.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

drpt-0.6.1-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file drpt-0.6.1.tar.gz.

File metadata

  • Download URL: drpt-0.6.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for drpt-0.6.1.tar.gz
Algorithm Hash digest
SHA256 b9f275ea9c1870a27e9bc5f1bf1d5005eb54c2c4841c16dd7c09f736a1809cf7
MD5 d3bdb3a1a6936f8cc1e71b94c2baa1d1
BLAKE2b-256 ba33084a62ab870a895b4708673f399f22259a0c7866374371342dcf2984b672

See more details on using hashes here.

File details

Details for the file drpt-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: drpt-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for drpt-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6fc551579eafcad5901be5f2959176059e8e591cfe95cf481ed7c948d1cf1f61
MD5 bb1922999a451a2b52538d2aa0b691f7
BLAKE2b-256 27cf71c8d428ea3b0429e72f6996ad4d148a271e5243e765227687a09ddf4b18

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page