Utilities for preparing datasets for publication
Project description
dataset-prep
This python package provides utilities for prepping datasets for publication, building on the Frictionless data framework and corresponding python package.
This package is currently in alpha status and provides a script for generating field-level information from a frictionless datapackage file for inclusion in a dataset readme (plain text) or accompanying data dictionary (CSV). The script assumes you have already created a datapackage to describe your dataset.
Basic Usage
Install the package from python using your preferred method (pip or uv):
pip install dataset-prep
Run the dataset-readme-info script with a path to your datapackage file. The data files
referenced in the datapackage must be present at the path specified.
[!NOTE] We highly recommend running
frictionless validateon your datapackage to ensure your dataset and your datapackage agree on the structure of your data!
To generate a plain-text list of fields with the descriptions in the datapackage file:
dataset-readme-info my-dataset/datapackage.json
The script will output text content to the console, which can be copied and pasted into the readme for your dataset.
To generate a CSV data dictionary with field information (description, type, name) for each resource described in the datapackage file, specify the path where the file should be generated:
dataset-readme-info my-dataset/datapackage.json --data-dictionary my-dataset/datadictionary.csv
Use the -h or --help option for script usage.
Examples
The dataset-readme-info script is generalized from one that was used to help prepare datasets from the Shakespeare and Company Project for publication.
The 2.0 version of the data published in 2025 includes a CSV data dictionary:
Koeser, Rebecca Sutton & Kotin, Joshua. (2025). Shakespeare and Company Project Datasets [Data set]. Version 2. Princeton University. https://doi.org/10.34770/kf6c-b079
The 1.2 version of the data published in 2022 includes field details in the README:
Kotin, Joshua, Koeser, Rebecca Sutton, et al. (2022). Shakespeare and Company Project Dataset: Lending Library Members, Books, Events [Data set]. Version 1.2. Princeton University. https://doi.org/10.34770/dtqa-2981
License
This project is licensed under the Apache 2.0 License.
(c)2025 Trustees of Princeton University. Permission granted for non-commercial distribution online under a standard Open Source license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataset_prep-0.1.0.tar.gz.
File metadata
- Download URL: dataset_prep-0.1.0.tar.gz
- Upload date:
- Size: 52.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
780a7e3b1e9c03bbe27f22fc04eb1ccb17bd3e0f85950a8826a585b4b4e5631b
|
|
| MD5 |
92dd14d789946f2e0cbdcc978876076d
|
|
| BLAKE2b-256 |
f14b7f8040f27d57bf31c818d0701b11af904863483aa23525992522625fa682
|
Provenance
The following attestation bundles were made for dataset_prep-0.1.0.tar.gz:
Publisher:
python-publish.yml on Princeton-CDH/dataset-prep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataset_prep-0.1.0.tar.gz -
Subject digest:
780a7e3b1e9c03bbe27f22fc04eb1ccb17bd3e0f85950a8826a585b4b4e5631b - Sigstore transparency entry: 905729210
- Sigstore integration time:
-
Permalink:
Princeton-CDH/dataset-prep@b2fe5a7cd492b367b573dc4f3fe44bb7af533e0d -
Branch / Tag:
refs/tags/0.1 - Owner: https://github.com/Princeton-CDH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b2fe5a7cd492b367b573dc4f3fe44bb7af533e0d -
Trigger Event:
release
-
Statement type:
File details
Details for the file dataset_prep-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataset_prep-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11bf187fb0112a290fa88d436dc9e6eae0053e45913068fc2f69d3a1423926cd
|
|
| MD5 |
b6cb6a2944affae4e7595ab7ce6492cb
|
|
| BLAKE2b-256 |
cbe7df51928b62b52241fdd779e30ac51760789ce031431d56c635b1dfbce2ae
|
Provenance
The following attestation bundles were made for dataset_prep-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on Princeton-CDH/dataset-prep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataset_prep-0.1.0-py3-none-any.whl -
Subject digest:
11bf187fb0112a290fa88d436dc9e6eae0053e45913068fc2f69d3a1423926cd - Sigstore transparency entry: 905729245
- Sigstore integration time:
-
Permalink:
Princeton-CDH/dataset-prep@b2fe5a7cd492b367b573dc4f3fe44bb7af533e0d -
Branch / Tag:
refs/tags/0.1 - Owner: https://github.com/Princeton-CDH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b2fe5a7cd492b367b573dc4f3fe44bb7af533e0d -
Trigger Event:
release
-
Statement type: