Package to create and track provenance metadata for datasets

These details have not been verified by PyPI

Project links

Project description

provolone

A Python library to create and maintain provenance-related metadata while processing raw files into analysis datasets.

This package provides a complete pipeline for provenance-related metadata. It allows you to create initial metadata for existing files, or add it to files as you download them. Once created, the provenance metadata will be propagated through to the final analysis dataset. The package provides the ability to freeze "snapshots" of data that cannot be overwritten, allowing easy preservation. Last, the package automatically tracks the steps that were taken in the processing of the raw files and stores them as additional metadata, allowing you to see and execute the steps taken to create a given final dataset from raw files, even if the code used to create the final dataset has since changed.

Installation

Install the package from source:

git clone https://github.com/dgreenwald/provolone.git
cd provolone
pip install -e ".[dev]"

Quick Start

Loading Datasets

Load a dataset using the main API:

import provolone

# Load a dataset
df = provolone.load("example")

# Load with metadata
df, metadata = provolone.load_with_metadata("example")

# List available datasets
datasets = provolone.list_datasets()
print(datasets)  # ['example']

Using Snapshots

Snapshots allow you to freeze datasets at specific points in time:

# Create a snapshot
provolone.freeze("example", snapshot="2024-01-15")

# Load from a snapshot
df = provolone.load("example", snapshot="2024-01-15")

Command Line Interface

provolone provides a CLI for common operations:

# Build a dataset and display info
provolone build example

# Build with parameters
provolone build example --params vintage=2024

# Build and display the first 10 rows
provolone build example --head 10

# Create a snapshot (freeze dataset at a point in time)
provolone freeze example --label 2024-01-15

# Create a snapshot with parameters
provolone freeze example --label prod-2024 --params vintage=2024

# Force overwrite an existing snapshot
provolone freeze example --label 2024-01-15 --force

# List available snapshots for a dataset
provolone list example

# List snapshots from a custom directory
provolone list example --snapshot-dir /custom/path

# Display metadata information for a cached dataset
provolone info example

# Display metadata for a specific snapshot
provolone info example --snapshot 2024-01-15

# Display metadata from a custom directory
provolone info example --snapshot-dir /custom/path

# Tag a file with metadata (creates sidecar .meta.json file)
provolone tag data.csv --raw_file_url "https://example.com/data.csv"

# Tag with source and notes
provolone tag data.csv \
  --raw_file_source "Bureau of Labor Statistics" \
  --raw_file_notes "Downloaded on 2024-12-27"

# Download a file from URL and automatically tag it
provolone download https://example.com/data.csv

# Download to a specific destination
provolone download https://example.com/data.csv --destination /path/to/file.csv

# Download with metadata
provolone download https://example.com/data.csv \
  --source "Bureau of Labor Statistics" \
  --notes "Production data 2024"

Configuration

provolone uses environment variables for configuration:

export PROVOLONE_DATA_ROOT="~/data"              # Where raw data files are stored
export PROVOLONE_CACHE_DIR="~/.cache/provolone"    # Cache directory
export PROVOLONE_SNAPSHOTS_DIR="~/.local/share/provolone/snapshots"  # Snapshots directory
export PROVOLONE_IO_FORMAT="parquet"             # File format: "parquet" or "feather"  
export PROVOLONE_IO_COMPRESSION="zstd"           # Compression: "zstd", "lz4", or None

You can also create a .env file in your project directory.

Creating Custom Datasets

To create a new dataset, inherit from BaseDataset:

from provolone.datasets.base import BaseDataset
from provolone.datasets import register
import pandas as pd

@register("my_dataset")
class MyDataset(BaseDataset):
    name = "my_dataset"
    frequency = "m"  # monthly
    
    def fetch(self):
        """Download or locate raw data files."""
        # Return path to raw data or None if data is in-memory
        pass
    
    def parse(self, raw) -> pd.DataFrame:
        """Convert raw data to DataFrame."""
        # Parse raw data into pandas DataFrame
        pass
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply dataset-specific transformations."""
        # Optional: apply custom transformations
        return df

[project.entry-points."provolone.datasets"]
my_dataset = "my_package.my_dataset.loader"

Architecture

Package Structure

Source Layout: Uses src/provolone/ layout with installable package
Tests: Located in tests/ with pytest framework
Configuration: Managed via src/provolone/config.py with Pydantic settings
CLI: Available via src/provolone/cli.py using Typer
Datasets: Plugin-based system in src/provolone/datasets/
Caching: Data caching and snapshots via src/provolone/cache.py and src/provolone/snapshots.py

Key Features

Intelligent Caching: Datasets are automatically cached to avoid recomputation
Snapshot System: Create immutable dataset versions with metadata
Plugin Architecture: Easy to add new datasets via entry points
Format Support: Supports Parquet and Feather with compression
Metadata Tracking: Comprehensive metadata for data lineage and verification
CLI Interface: Command-line tools for data operations

Data Processing Pipeline

Fetch: Get raw data (files, APIs, etc.)
Parse: Convert to pandas DataFrame
Transform: Apply dataset-specific processing
Standardize: Normalize columns, handle indexes
Cache: Store processed data for reuse

Development

Running Tests

pytest

Code Quality

# Format code
black src/ tests/

# Lint code  
ruff check src/ tests/

# Type checking
mypy src/

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Oct 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

provolone-0.0.1.tar.gz (44.9 kB view details)

Uploaded Oct 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

provolone-0.0.1-py3-none-any.whl (30.3 kB view details)

Uploaded Oct 3, 2025 Python 3

File details

Details for the file provolone-0.0.1.tar.gz.

File metadata

Download URL: provolone-0.0.1.tar.gz
Upload date: Oct 3, 2025
Size: 44.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for provolone-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4fc7381c3eb8e01c1ad8fc798e69611a0bfef3279c58bb88259df5c920e55a66`
MD5	`ed73a61dc65f682c002874b2222c8288`
BLAKE2b-256	`ddbc4dc2a2b025a35b9012739af5fd442aa7f52ff42b6f7459354e0a926cd8b3`

See more details on using hashes here.

File details

Details for the file provolone-0.0.1-py3-none-any.whl.

File metadata

Download URL: provolone-0.0.1-py3-none-any.whl
Upload date: Oct 3, 2025
Size: 30.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for provolone-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`054bde5ae313addec2dc3073f436024b7bce21c462d05f44c16f6b5cf236dae8`
MD5	`044b27ec8e06d89eced517438478fe10`
BLAKE2b-256	`8411ea5f468756999069d25629c4ebd0f5a36e6ff2f32e1aad227894944242dd`

See more details on using hashes here.

provolone 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

provolone

Installation

Quick Start

Loading Datasets

Using Snapshots

Command Line Interface

Configuration

Creating Custom Datasets

Architecture

Package Structure

Key Features

Data Processing Pipeline

Development

Running Tests

Code Quality

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes