Package to create and track provenance metadata for datasets
Project description
provolone
A Python library to create and maintain provenance-related metadata while processing raw files into analysis datasets.
This package provides a complete pipeline for provenance-related metadata. It allows you to create initial metadata for existing files, or add it to files as you download them. Once created, the provenance metadata will be propagated through to the final analysis dataset. The package provides the ability to freeze "snapshots" of data that cannot be overwritten, allowing easy preservation. Last, the package automatically tracks the steps that were taken in the processing of the raw files and stores them as additional metadata, allowing you to see and execute the steps taken to create a given final dataset from raw files, even if the code used to create the final dataset has since changed.
Installation
Install the package from source:
git clone https://github.com/dgreenwald/provolone.git
cd provolone
pip install -e ".[dev]"
Quick Start
Loading Datasets
Load a dataset using the main API:
import provolone
# Load a dataset
df = provolone.load("example")
# Load with metadata
df, metadata = provolone.load_with_metadata("example")
# List available datasets
datasets = provolone.list_datasets()
print(datasets) # ['example']
Using Snapshots
Snapshots allow you to freeze datasets at specific points in time:
# Create a snapshot
provolone.freeze("example", snapshot="2024-01-15")
# Load from a snapshot
df = provolone.load("example", snapshot="2024-01-15")
Command Line Interface
provolone provides a CLI for common operations:
# Build a dataset and display info
provolone build example
# Build with parameters
provolone build example --params vintage=2024
# Build and display the first 10 rows
provolone build example --head 10
# Create a snapshot (freeze dataset at a point in time)
provolone freeze example --label 2024-01-15
# Create a snapshot with parameters
provolone freeze example --label prod-2024 --params vintage=2024
# Force overwrite an existing snapshot
provolone freeze example --label 2024-01-15 --force
# List available snapshots for a dataset
provolone list example
# List snapshots from a custom directory
provolone list example --snapshot-dir /custom/path
# Display metadata information for a cached dataset
provolone info example
# Display metadata for a specific snapshot
provolone info example --snapshot 2024-01-15
# Display metadata from a custom directory
provolone info example --snapshot-dir /custom/path
# Tag a file with metadata (creates sidecar .meta.json file)
provolone tag data.csv --raw_file_url "https://example.com/data.csv"
# Tag with source and notes
provolone tag data.csv \
--raw_file_source "Bureau of Labor Statistics" \
--raw_file_notes "Downloaded on 2024-12-27"
# Download a file from URL and automatically tag it
provolone download https://example.com/data.csv
# Download to a specific destination
provolone download https://example.com/data.csv --destination /path/to/file.csv
# Download with metadata
provolone download https://example.com/data.csv \
--source "Bureau of Labor Statistics" \
--notes "Production data 2024"
Configuration
provolone uses environment variables for configuration:
export PROVOLONE_DATA_ROOT="~/data" # Where raw data files are stored
export PROVOLONE_CACHE_DIR="~/.cache/provolone" # Cache directory
export PROVOLONE_SNAPSHOTS_DIR="~/.local/share/provolone/snapshots" # Snapshots directory
export PROVOLONE_IO_FORMAT="parquet" # File format: "parquet" or "feather"
export PROVOLONE_IO_COMPRESSION="zstd" # Compression: "zstd", "lz4", or None
You can also create a .env file in your project directory.
Creating Custom Datasets
To create a new dataset, inherit from BaseDataset:
from provolone.datasets.base import BaseDataset
from provolone.datasets import register
import pandas as pd
@register("my_dataset")
class MyDataset(BaseDataset):
name = "my_dataset"
frequency = "m" # monthly
def fetch(self):
"""Download or locate raw data files."""
# Return path to raw data or None if data is in-memory
pass
def parse(self, raw) -> pd.DataFrame:
"""Convert raw data to DataFrame."""
# Parse raw data into pandas DataFrame
pass
def transform(self, df: pd.DataFrame) -> pd.DataFrame:
"""Apply dataset-specific transformations."""
# Optional: apply custom transformations
return df
Register your dataset in pyproject.toml:
[project.entry-points."provolone.datasets"]
my_dataset = "my_package.my_dataset.loader"
Architecture
Package Structure
- Source Layout: Uses
src/provolone/layout with installable package - Tests: Located in
tests/with pytest framework - Configuration: Managed via
src/provolone/config.pywith Pydantic settings - CLI: Available via
src/provolone/cli.pyusing Typer - Datasets: Plugin-based system in
src/provolone/datasets/ - Caching: Data caching and snapshots via
src/provolone/cache.pyandsrc/provolone/snapshots.py
Key Features
- Intelligent Caching: Datasets are automatically cached to avoid recomputation
- Snapshot System: Create immutable dataset versions with metadata
- Plugin Architecture: Easy to add new datasets via entry points
- Format Support: Supports Parquet and Feather with compression
- Metadata Tracking: Comprehensive metadata for data lineage and verification
- CLI Interface: Command-line tools for data operations
Data Processing Pipeline
- Fetch: Get raw data (files, APIs, etc.)
- Parse: Convert to pandas DataFrame
- Transform: Apply dataset-specific processing
- Standardize: Normalize columns, handle indexes
- Cache: Store processed data for reuse
Development
Running Tests
pytest
Code Quality
# Format code
black src/ tests/
# Lint code
ruff check src/ tests/
# Type checking
mypy src/
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file provolone-0.0.1.tar.gz.
File metadata
- Download URL: provolone-0.0.1.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fc7381c3eb8e01c1ad8fc798e69611a0bfef3279c58bb88259df5c920e55a66
|
|
| MD5 |
ed73a61dc65f682c002874b2222c8288
|
|
| BLAKE2b-256 |
ddbc4dc2a2b025a35b9012739af5fd442aa7f52ff42b6f7459354e0a926cd8b3
|
File details
Details for the file provolone-0.0.1-py3-none-any.whl.
File metadata
- Download URL: provolone-0.0.1-py3-none-any.whl
- Upload date:
- Size: 30.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
054bde5ae313addec2dc3073f436024b7bce21c462d05f44c16f6b5cf236dae8
|
|
| MD5 |
044b27ec8e06d89eced517438478fe10
|
|
| BLAKE2b-256 |
8411ea5f468756999069d25629c4ebd0f5a36e6ff2f32e1aad227894944242dd
|