Skip to main content

Utilities for preparing datasets for publication

Project description

dataset-prep

This python package provides utilities for prepping datasets for publication, building on the Frictionless data framework and corresponding python package.

This package is currently in alpha status and provides a script for generating field-level information from a frictionless datapackage file for inclusion in a dataset readme (plain text) or accompanying data dictionary (CSV). The script assumes you have already created a datapackage to describe your dataset.

PyPI - Version Apache 2 License

Basic Usage

Install the package from python using your preferred method (pip or uv):

pip install dataset-prep

Run the dataset-readme-info script with a path to your datapackage file. The data files referenced in the datapackage must be present at the path specified.

[!NOTE] We highly recommend running frictionless validate on your datapackage to ensure your dataset and your datapackage agree on the structure of your data!

To generate a plain-text list of fields with the descriptions in the datapackage file:

dataset-readme-info my-dataset/datapackage.json

The script will output text content to the console, which can be copied and pasted into the readme for your dataset.

To generate a CSV data dictionary with field information (description, type, name) for each resource described in the datapackage file, specify the path where the file should be generated:

dataset-readme-info my-dataset/datapackage.json --data-dictionary my-dataset/datadictionary.csv

Use the -h or --help option for script usage.

Examples

The dataset-readme-info script is generalized from one that was used to help prepare datasets from the Shakespeare and Company Project for publication.

The 2.0 version of the data published in 2025 includes a CSV data dictionary:

Koeser, Rebecca Sutton & Kotin, Joshua. (2025). Shakespeare and Company Project Datasets [Data set]. Version 2. Princeton University. https://doi.org/10.34770/kf6c-b079

The 1.2 version of the data published in 2022 includes field details in the README:

Kotin, Joshua, Koeser, Rebecca Sutton, et al. (2022). Shakespeare and Company Project Dataset: Lending Library Members, Books, Events [Data set]. Version 1.2. Princeton University. https://doi.org/10.34770/dtqa-2981

License

This project is licensed under the Apache 2.0 License.

(c)2025 Trustees of Princeton University. Permission granted for non-commercial distribution online under a standard Open Source license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_prep-0.1.0.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_prep-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file dataset_prep-0.1.0.tar.gz.

File metadata

  • Download URL: dataset_prep-0.1.0.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_prep-0.1.0.tar.gz
Algorithm Hash digest
SHA256 780a7e3b1e9c03bbe27f22fc04eb1ccb17bd3e0f85950a8826a585b4b4e5631b
MD5 92dd14d789946f2e0cbdcc978876076d
BLAKE2b-256 f14b7f8040f27d57bf31c818d0701b11af904863483aa23525992522625fa682

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_prep-0.1.0.tar.gz:

Publisher: python-publish.yml on Princeton-CDH/dataset-prep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataset_prep-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataset_prep-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataset_prep-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11bf187fb0112a290fa88d436dc9e6eae0053e45913068fc2f69d3a1423926cd
MD5 b6cb6a2944affae4e7595ab7ce6492cb
BLAKE2b-256 cbe7df51928b62b52241fdd779e30ac51760789ce031431d56c635b1dfbce2ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataset_prep-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on Princeton-CDH/dataset-prep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page