Skip to main content

Principled Data Processing CLI tool

Project description

pdp: Principled Data Processing

A command-line tool for reproducible data analysis workflows.

Principled Data Processing is a reproducible and readable way to organize data analyses. pdp is a command-line tool that makes using Principled Data Processing very simple. This tool is designed both to be opinionated (see Principles below), as well as easy to bolt onto existing PDP projects (see Usage).

Table of contents

Principles

  1. Projects are separated into tasks, which are folders in the filesystem. A task is either a collection of subtasks, which are themselves subdirectories in the task, or an atomic task which contains no further subtasks.

  2. Atomic tasks contain folders for input (input data for the task), src (source code), and output (where the task writes its outputs). Importantly, tasks only write to their output folders, and never read from their own outputs.

  3. Tasks have an entrypoint that allows them to be run with a single command. Usually this is make.

Install

Soon to be available on PyPI. Meanwhile, clone the repository, and run poetry install.

Requirements:

Usage

Editing an existing PDP project

  • Run pdp init from the root of a project. It creates a file called pdp.yml, which contains metadata about the project, and marks the project root.
  • Edit the tasks key in the pdp.yml file to list all the tasks in order.
  • Run pdp init again, to initialize the task.yml files within each of those tasks.
  • Edit each task.yml to designate the entrypoint, which would be make for most PDP projects.
  • Run pdp run from within a task to run that task. If in the project root, this runs all tasks.

Starting a new project from scratch

  • Run pdp init from the root of a project. It creates a file called pdp.yml, which contains metadata about the project, and marks the project root.
  • Run pdp create <name_of_task1> <name_of_task2> ... <name_of_taskN> to create tasks. This creates directories for each task, a task.yml configuration file, as well as the src, input, and output folders within that task.
  • Edit each task.yml to designate a specific command to run as an entrypoint, such as make.
  • Run pdp run from within a task to run that task. If in the project root, this runs all tasks.

Additional commands

  • Run pdp tree to see the tree structure of all tasks.
  • Run pdp validate to validate the project configuration.

Contributing

All code should be tested and formatted using black.

About

This work was developed with the support of a US Research Software Sustainability Institute early-career fellowship. Thanks to helpful discussions with Patrick Ball, Bailey Passmore, and Tarak Shah. Principled Data Processing was a framework developed by the Human Rights Data Analysis Group in the early 2000s, to facilitate reproducibility in their forensic human rights work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdp_helper-0.1.0.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdp_helper-0.1.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file pdp_helper-0.1.0.tar.gz.

File metadata

  • Download URL: pdp_helper-0.1.0.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for pdp_helper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 50c105b7b9597fdea3af2228c4c9e729936e3083c8ed94342367e21593504b62
MD5 16995fcc02e646b2d2cf9964854cd0ff
BLAKE2b-256 6afea0c6fd240da6f78c0f314cb8a53ed1dfe3205209859bfbd1b8cb60a07483

See more details on using hashes here.

File details

Details for the file pdp_helper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdp_helper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for pdp_helper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed5494e0830e6798790262b506ea09b7c1ddb91be88dacd6cb2091417ea3fa72
MD5 6f8e60fc698e6e495b3acca4563e8bd5
BLAKE2b-256 9ff7e335173fdef4bea76e0568809b19ca8213cf9f444389dc28360084e19b50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page