Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.
Project description
Data Release Preparation Tool
:warning: This is currently at beta development stage and likely has a lot of bugs. Please use the issue tracker to report an bugs or feature requests.
Description
Command-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.
After performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.
Installation
The tool can be installed using pip:
pip install drpt
Usage
CLI
Usage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE
Options:
-d, --dry-run Generate only the report without the release dataset
-v, --verbose Verbose [Not implemented]
-n, --nrows TEXT Number of rows to read from a CSV file. Doesn't work
with parquet files.
-l, --limits-file PATH Limits file
-o, --output-dir PATH Output directory. The default output directory is
the same as the location of the recipe_file.
--version Show the version and exit.
--help Show this message and exit.
Recipe Definition
Overview
The recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a version
key which is appended in the generated filenames and the report.
Default recipe:
{
"version": "",
"actions": {
"drop": [],
"drop-constant-columns": false,
"obfuscate": [],
"disable-scaling": false,
"skip-scaling": [],
"rename": []
}
}
The currently supported actions, performed in this order, are as follows:
drop
: Column deletiondrop-constant-columns
: Drops all columns that containt only one unique valueobfuscate
: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.- Scaling: By default all columns are Min/Max scaled
disable-scaling
: Can be used to disable scaling for all columnsskip-scaling
: By default all columns are Min/Max scaled, except those excluded (skip-scaling
)
rename
: Column renaming
All column definitions above support regular expressions.
Actions
drop
The drop
action is defined as a list of column names to be dropped.
drop-constant-columns
This is a boolean action, which when set to true
will drop all the columns that have only a single unique value.
obfuscate
The obfuscate
action is defined as a list of column names to be obfuscated.
disable-scaling, skip-scaling
By default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the disable-scaling
action to true
. If scaling must be disabled for only a set of columns these columns can be defined using the skip-scaling
action, as a list of column names.
rename
The rename
action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (\\1
) [see example below].
{
//...
"rename": [{"original_name": "target_name"}]
//...
}
Example
Input CSV file:
test1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const
1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1
2.2,2,two,2,0.555,0.4,0,b,f,2,2,1
3.3,3,three,4,0.1,5,1,c,g,3,3,1
2.22,2,two,4,1,0,2.5,d,h,4,4,1
Recipe:
{
"version": "1.0",
"actions": {
"drop": ["test2", "test[8-9]"],
"drop-constant-columns": true,
"obfuscate": ["test3"],
"skip-scaling": ["test4"],
"rename": [
{ "test1": "test1_renamed" },
{ "test([3-4])": "test\\1_regex_renamed" },
{ "foo[.]bar[.].*": "foo" }
]
}
}
Generated CSV file:
test1_renamed,test3_regex_renamed,test4_regex_renamed,test5,test6,test7,foo_1,foo_2
0.0,0,2,0.1488888888888889,0.06,0.0,0.0,0.0
0.5000000000000001,2,2,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333
1.0,1,4,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666
0.5090909090909091,2,4,1.0,0.0,1.0,1.0,1.0
Report:
,action,column,details
0,recipe_version,,1.0
1,drpt_version,,0.2.8
2,DROP,test2,
3,DROP,test8,
4,DROP,test9,
5,DROP_CONSTANT,const,
6,OBFUSCATE,test3,
7,SCALE_DEFAULT,test1,"[1.1,3.3]"
8,SCALE_DEFAULT,test5,"[0.1,1.0]"
9,SCALE_DEFAULT,test6,"[0.0,5.0]"
10,SCALE_DEFAULT,test7,"[-1.0,2.5]"
11,SCALE_DEFAULT,foo.bar.test,"[1,4]"
12,SCALE_DEFAULT,foo.bar.test2,"[1,4]"
13,RENAME,test1,test1_renamed
14,RENAME,test3,test3_regex_renamed
15,RENAME,test4,test4_regex_renamed
16,RENAME,foo.bar.test,foo_1
17,RENAME,foo.bar.test2,foo_2
Thanks
This tool was made possible with Pandas, PyArrow, jsonschema, and of course Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file drpt-0.6.0.tar.gz
.
File metadata
- Download URL: drpt-0.6.0.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35b55b35c8b9dab4932b0d1e5463194e5fc097e5cddc10437e86287db5979e40 |
|
MD5 | d86a31f9bf95010d246f9da50a6b90fb |
|
BLAKE2b-256 | 571d755aa2367d27d1d320e7173f5b09494fe49becb9036ca0eb91297d15a520 |
File details
Details for the file drpt-0.6.0-py3-none-any.whl
.
File metadata
- Download URL: drpt-0.6.0-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 291d283410504b9d3d72627f5ca4f93ebc85e72147b65807186a513ea5100127 |
|
MD5 | 43b36959f83d025e09c6bb302d641d5b |
|
BLAKE2b-256 | 954ddbe351157456ef717c4d26e4041f8a09fb6ffba6963ae5a22e5da1791598 |