Principled Data Processing CLI tool
Project description
pdp: Principled Data Processing
A command-line tool for reproducible data analysis workflows.
Principled Data Processing is a reproducible and readable way to organize data analyses.
pdp is a command-line tool that makes using Principled Data Processing very simple.
This tool is designed both to be opinionated (see Principles below), as well as easy to bolt onto existing PDP projects (see Usage).
Table of contents
Principles
-
Projects are separated into tasks, which are folders in the filesystem. A task is either a collection of subtasks, which are themselves subdirectories in the task, or an atomic task which contains no further subtasks.
-
Atomic tasks contain folders for
input(input data for the task),src(source code), andoutput(where the task writes its outputs). Importantly, tasks only write to their output folders, and never read from their own outputs. -
Tasks have an entrypoint that allows them to be run with a single command. Usually this is
make.
Install
Soon to be available on PyPI. Meanwhile, clone the repository, and run poetry install.
Requirements:
- Python 3.11+
- poetry
Usage
Editing an existing PDP project
- Run
pdp initfrom the root of a project. It creates a file calledpdp.yml, which contains metadata about the project, and marks the project root. - Edit the
taskskey in thepdp.ymlfile to list all the tasks in order. - Run
pdp initagain, to initialize thetask.ymlfiles within each of those tasks. - Edit each
task.ymlto designate the entrypoint, which would bemakefor most PDP projects. - Run
pdp runfrom within a task to run that task. If in the project root, this runs all tasks.
Starting a new project from scratch
- Run
pdp initfrom the root of a project. It creates a file calledpdp.yml, which contains metadata about the project, and marks the project root. - Run
pdp create <name_of_task1> <name_of_task2> ... <name_of_taskN>to create tasks. This creates directories for each task, atask.ymlconfiguration file, as well as thesrc,input, andoutputfolders within that task. - Edit each
task.ymlto designate a specific command to run as an entrypoint, such asmake. - Run
pdp runfrom within a task to run that task. If in the project root, this runs all tasks.
Additional commands
- Run
pdp treeto see the tree structure of all tasks. - Run
pdp validateto validate the project configuration.
Contributing
All code should be tested and formatted using black.
About
This work was developed with the support of a US Research Software Sustainability Institute early-career fellowship. Thanks to helpful discussions with Patrick Ball, Bailey Passmore, and Tarak Shah. Principled Data Processing was a framework developed by the Human Rights Data Analysis Group in the early 2000s, to facilitate reproducibility in their forensic human rights work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdp_helper-0.1.0.tar.gz.
File metadata
- Download URL: pdp_helper-0.1.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c105b7b9597fdea3af2228c4c9e729936e3083c8ed94342367e21593504b62
|
|
| MD5 |
16995fcc02e646b2d2cf9964854cd0ff
|
|
| BLAKE2b-256 |
6afea0c6fd240da6f78c0f314cb8a53ed1dfe3205209859bfbd1b8cb60a07483
|
File details
Details for the file pdp_helper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdp_helper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed5494e0830e6798790262b506ea09b7c1ddb91be88dacd6cb2091417ea3fa72
|
|
| MD5 |
6f8e60fc698e6e495b3acca4563e8bd5
|
|
| BLAKE2b-256 |
9ff7e335173fdef4bea76e0568809b19ca8213cf9f444389dc28360084e19b50
|