Python library for managing datasets with lineage tracking in Sunstone projects

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

stigsb

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language

Project description

sunstone-py

A Python library for managing datasets with lineage tracking in data science projects.

Features

Automatic Lineage Tracking: Track data provenance through all operations automatically
Dataset Management: Integration with datasets.yaml for organized dataset registration
Pandas-Compatible API: Familiar pandas-like interface via from sunstone import pandas as pd (CSV, Excel, JSON)
Plugin System: Extensible architecture for custom auth providers, URL handlers, and format handlers via entry points
Strict/Relaxed Modes: Control whether operations can modify datasets.yaml
Validation Tools: Check notebooks and scripts for correct import usage
Full Type Hints: Complete type hint support for better IDE integration

Installation

# Using uv (recommended)
uv add sunstone-py

# Using pip
pip install sunstone-py

To use the latest commit from github:

dependencies = [
    "sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]

If you are making changes to a local checkout of sunstone-py and want to test them from your project, add a [tool.uv.sources] override to your project's pyproject.toml:

[tool.uv.sources]
sunstone-py = { path = "../path/to/sunstone-py", editable = true }

The path is relative to your project's pyproject.toml. Leave the regular PyPI dependency in [project.dependencies] unchanged — the sources override takes precedence locally. Remember to remove the [tool.uv.sources] block before committing.

For Development

git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Quick Start

1. Set Up Your Project with datasets.yaml

Create a datasets.yaml file in your project directory:

inputs:
  - name: School Data
    slug: school-data
    location: data/schools.csv
    source:
      name: Ministry of Education
      location:
        data: https://example.com/schools.csv
      attributedTo: Ministry of Education
      acquiredAt: 2025-01-15
      acquisitionMethod: manual-download
      license: CC-BY-4.0
    fields:
      - name: school_id
        type: string
      - name: enrollment
        type: integer

outputs: []

2. Use Pandas-Like API with Lineage Tracking

from sunstone import pandas as pd
from pathlib import Path

# Set project path (where datasets.yaml lives)
PROJECT_PATH = Path.cwd()

# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)

# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()

# Save with automatic lineage tracking and dataset registration
result.to_csv(
    'outputs/summary.csv',
    slug='school-summary',
    name='School Enrollment Summary',
    index=False
)

3. Check Lineage Metadata

# View lineage information
print(result.lineage.sources)      # Source datasets
print(result.lineage.operations)   # Operations performed
print(result.lineage.get_licenses())  # All source licenses

Core Concepts

Pandas-Like API

sunstone-py provides a drop-in replacement for pandas that adds lineage tracking:

from sunstone import pandas as pd

# Works like pandas, but tracks lineage
df = pd.read_csv('input.csv', project_path='/path/to/project')
df2 = pd.read_csv('input2.csv', project_path='/path/to/project')

# All pandas operations work
filtered = df[df['value'] > 100]
grouped = df.groupby('category').sum()

# Merge/join operations combine lineage from both sources
merged = pd.merge(df, df2, on='key')
concatenated = pd.concat([df, df2])

Strict vs Relaxed Mode

Relaxed Mode (default):

Writing to new outputs auto-registers them in datasets.yaml
More flexible for exploratory work

Strict Mode:

All reads and writes must be pre-registered in datasets.yaml
Ensures complete documentation of data operations
Enable via strict=True parameter or SUNSTONE_DATAFRAME_STRICT=1 environment variable

# Enable strict mode
df = pd.read_csv('data.csv', project_path=PROJECT_PATH, strict=True)

# Or globally
import os
os.environ['SUNSTONE_DATAFRAME_STRICT'] = '1'

Validation Tools

Check notebooks for correct import usage:

import sunstone

# Check a single notebook
result = sunstone.check_notebook_imports('analysis.ipynb')
print(result.summary())

# Check all notebooks in project
results = sunstone.validate_project_notebooks('/path/to/project')
for path, result in results.items():
    if not result.is_valid:
        print(f"\n{path}:")
        print(result.summary())

Plugin System

sunstone-py uses a plugin architecture for reading, writing, and fetching data. Built-in handlers cover common formats (CSV, JSON, Excel, Parquet, TSV) and HTTP/HTTPS, local file, GCS, and S3/R2 URLs.

Plugin Protocols

Plugins implement one or more of these protocols:

AuthProvider: Injects authentication headers into HTTP requests
URLHandler: Opens URLs for reading/writing, returning file-like streams (BinaryIO/TextIO)
FormatHandler: Reads and writes data formats not built into sunstone

Installation Extras

pip install sunstone-py          # Core + HTTP + local file handling
pip install sunstone-py[gcs]     # Adds GCS (gs://) support
pip install sunstone-py[s3]      # Adds S3 (s3://) and R2 (r2://) support
pip install sunstone-py[gcs,s3]  # Both

Registering Custom Plugins

Plugins are discovered via Python entry points:

[project.entry-points."sunstone.plugins"]
my-plugin = "my_package:MyPlugin"

Plugin Configuration

Plugin config uses cascading precedence (later sources override earlier):

datasets.yaml — plugins.<name> section
pyproject.toml — [tool.sunstone.plugins.<name>] table
Environment variables — SUNSTONE_PLUGIN_<NAME>_<KEY>

Advanced Usage

Direct DataFrame API

For more control, use the DataFrame class directly:

from sunstone import DataFrame

# Read with explicit parameters
df = DataFrame.read_csv(
    'data.csv',
    project_path='/path/to/project',
    strict=True
)

# Access underlying pandas DataFrame
pandas_df = df.data

DataFrame Metadata

Set metadata on DataFrames that flows through to datasets.yaml on write:

from sunstone import pandas as pd

df = pd.read_csv('input.csv', project_path=PROJECT_PATH)
result = df[df['value'] > 100]

# Set output identity and description
result.metadata.slug = "filtered-data"
result.metadata.name = "Filtered Data"
result.metadata.description = "Values above threshold"

# Set RDF metadata
result.metadata.rdf_prefixes = {"schema": "https://schema.org/"}
result.metadata.custom_properties = {"schema:about": "Analysis"}

# Annotate columns
result.set_field_metadata("value", description="Measured value", unit="kg")

# Write — slug/name come from metadata
result.to_csv('outputs/filtered.csv', index=False)

Available metadata:

df.metadata.slug: Dataset slug (used at write time)
df.metadata.name: Dataset name (used at write time)
df.metadata.description: Dataset description
df.metadata.rdf_prefixes: RDF namespace prefixes
df.metadata.custom_properties: Custom properties (RDF-style)
df.set_field_metadata(column, *, description, unit, source, type, constraints): Annotate a column

Managing datasets.yaml Programmatically

from sunstone import DatasetsManager, FieldSchema

manager = DatasetsManager('/path/to/project')

# Find datasets
dataset = manager.find_dataset_by_slug('school-data')
dataset = manager.find_dataset_by_location('data/schools.csv')

# Add new output dataset
manager.add_output_dataset(
    name='Analysis Results',
    slug='analysis-results',
    location='outputs/results.csv',
    fields=[
        FieldSchema(name='category', type='string'),
        FieldSchema(name='count', type='integer'),
        FieldSchema(name='avg_value', type='number')
    ],
    publish=True
)

Documentation

API Reference

pandas Module

Drop-in replacement for pandas with lineage tracking:

read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage
read_excel(filepath, project_path, strict=False, **kwargs): Read Excel (.xlsx/.xls) with lineage
read_json(filepath, project_path, strict=False, **kwargs): Read JSON with lineage
merge(left, right, **kwargs): Merge DataFrames with combined lineage
concat(dfs, **kwargs): Concatenate DataFrames with combined lineage

DataFrame Class

Main class for working with data:

read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage tracking
read_excel(filepath, project_path, strict=False, **kwargs): Read Excel with lineage tracking
to_csv(path, slug, name, publish=False, **kwargs): Write CSV and register
merge(right, **kwargs): Merge with another DataFrame
join(other, **kwargs): Join with another DataFrame
concat(others, **kwargs): Concatenate DataFrames
set_field_metadata(column, **kwargs): Annotate column metadata
.data: Access underlying pandas DataFrame
.metadata: Access unified metadata container
.lineage: Access lineage metadata (deprecated — use .metadata.lineage)

DatasetsManager Class

Manage datasets.yaml files:

find_dataset_by_location(location, dataset_type='input'): Find by file path
find_dataset_by_slug(slug, dataset_type='input'): Find by slug
get_all_inputs(): Get all input datasets
get_all_outputs(): Get all output datasets
add_output_dataset(...): Register new output
update_output_dataset(...): Update existing output

Validation Functions

check_notebook_imports(notebook_path): Validate a single notebook
validate_project_notebooks(project_path): Validate all notebooks in project

Plugin Protocols

AuthProvider: Implement authenticate(url, headers, dataset) -> headers to inject auth
URLHandler: Implement can_handle(url) -> bool and open(url, mode) -> BinaryIO | TextIO
FormatHandler: Implement can_read(path, format), read(stream, **kwargs), can_write(path, format), write(df, stream, **kwargs)

PluginRegistry Class

Singleton that discovers and manages plugins:

PluginRegistry.get(): Get the singleton registry instance
get_auth_providers(): Return all registered auth providers
get_url_handlers(): Return all registered URL handlers
get_format_handlers(): Return all registered format handlers
find_url_handler(url): Find first handler that can handle a URL
find_format_reader(path, format): Find first handler that can read a file
find_format_writer(path, format): Find first handler that can write a file
fetch(url, dest): Convenience — download URL to local file via open()

Exceptions

SunstoneError: Base exception
DatasetNotFoundError: Dataset not found in datasets.yaml
StrictModeError: Operation blocked in strict mode
DatasetValidationError: Validation failed
LineageError: Lineage tracking error

Environment Variables

SUNSTONE_DATAFRAME_STRICT: Set to "1" or "true" to enable strict mode globally
SUNSTONE_PLUGIN_<NAME>_<KEY>: Override plugin configuration (highest precedence)

Development

See CONTRIBUTING.md for development setup and guidelines.

Running Tests

uv run pytest

Type Checking

uv run mypy

Linting and Formatting

uv run ruff check
uv run ruff format

About Sunstone Institute

Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues

Made with ❤️ by Sunstone Institute

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

stigsb

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

1.10.0

May 6, 2026

1.9.1

May 5, 2026

1.9.0

Apr 29, 2026

1.8.0

Apr 23, 2026

1.7.0

Apr 22, 2026

1.6.1

Apr 17, 2026

1.6.0

Apr 17, 2026

1.5.0

Apr 13, 2026

1.4.3

Apr 13, 2026

1.4.2

Apr 13, 2026

1.4.1

Apr 11, 2026

1.4.0

Apr 11, 2026

1.3.1

Mar 27, 2026

1.3.0

Mar 25, 2026

1.2.7

Mar 18, 2026

1.2.6

Mar 12, 2026

1.2.5

Mar 12, 2026

1.2.4

Mar 12, 2026

1.2.3

Mar 12, 2026

1.2.2

Mar 11, 2026

1.2.1

Mar 10, 2026

1.2.0

Mar 7, 2026

1.1.1

Mar 6, 2026

1.1.0

Feb 13, 2026

1.0.1

Feb 4, 2026

1.0.0

Feb 4, 2026

0.7.0

Feb 4, 2026

0.6.0

Feb 4, 2026

0.5.3

Dec 4, 2025

0.5.2

Dec 4, 2025

0.5.1

Dec 4, 2025

0.5.0

Dec 3, 2025

0.4.2

Nov 28, 2025

0.4.1

Nov 28, 2025

0.4.0

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sunstone_py-1.10.0.tar.gz (181.1 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sunstone_py-1.10.0-py3-none-any.whl (94.3 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file sunstone_py-1.10.0.tar.gz.

File metadata

Download URL: sunstone_py-1.10.0.tar.gz
Upload date: May 6, 2026
Size: 181.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sunstone_py-1.10.0.tar.gz
Algorithm	Hash digest
SHA256	`cb4cf1ea9617f072b5999557e99b28bdc1ad2b3aee3b9c2e6e202c7a5ced30c2`
MD5	`15c1b940225d19335f2e0bda38f9b303`
BLAKE2b-256	`0067536715c376e81a94b1dcc85a562ab9a716e464fe4a92d0927f35a698b449`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sunstone_py-1.10.0.tar.gz:

Publisher: release.yml on sunstoneinstitute/sunstone-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sunstone_py-1.10.0.tar.gz
- Subject digest: cb4cf1ea9617f072b5999557e99b28bdc1ad2b3aee3b9c2e6e202c7a5ced30c2
- Sigstore transparency entry: 1444569990
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: sunstoneinstitute/sunstone-py@bce141914d43fe399a27a2035fbab2c6bd258a0a
- Branch / Tag: refs/tags/v1.10.0
- Owner: https://github.com/sunstoneinstitute
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce141914d43fe399a27a2035fbab2c6bd258a0a
- Trigger Event: push

File details

Details for the file sunstone_py-1.10.0-py3-none-any.whl.

File metadata

Download URL: sunstone_py-1.10.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 94.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for sunstone_py-1.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1765d9d5ce93028f200bed7f58519d5bd1105ed599fdcb0c162be90df3e2bcd9`
MD5	`c4a399827cf60a8c0dd1feaed2306071`
BLAKE2b-256	`6514f0d561b266b22442d94c62e7460581143842b6d30149d98f8d5ee4e465d7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sunstone_py-1.10.0-py3-none-any.whl:

Publisher: release.yml on sunstoneinstitute/sunstone-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sunstone_py-1.10.0-py3-none-any.whl
- Subject digest: 1765d9d5ce93028f200bed7f58519d5bd1105ed599fdcb0c162be90df3e2bcd9
- Sigstore transparency entry: 1444570054
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: sunstoneinstitute/sunstone-py@bce141914d43fe399a27a2035fbab2c6bd258a0a
- Branch / Tag: refs/tags/v1.10.0
- Owner: https://github.com/sunstoneinstitute
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bce141914d43fe399a27a2035fbab2c6bd258a0a
- Trigger Event: push

sunstone-py 1.10.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sunstone-py

Features

Installation

For Development

Quick Start

1. Set Up Your Project with datasets.yaml

2. Use Pandas-Like API with Lineage Tracking

3. Check Lineage Metadata

Core Concepts

Pandas-Like API

Strict vs Relaxed Mode

Validation Tools

Plugin System

Plugin Protocols

Installation Extras

Registering Custom Plugins

Plugin Configuration

Advanced Usage

Direct DataFrame API

DataFrame Metadata

Managing datasets.yaml Programmatically

Documentation

API Reference

pandas Module

DataFrame Class

DatasetsManager Class

Validation Functions

Plugin Protocols

PluginRegistry Class

Exceptions

Environment Variables

Development

Running Tests

Type Checking

Linting and Formatting

About Sunstone Institute

License

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance