Python library for managing datasets with lineage tracking in Sunstone projects
Project description
sunstone-py
A Python library for managing datasets with lineage tracking in data science projects.
Features
- Automatic Lineage Tracking: Track data provenance through all operations automatically
- Dataset Management: Integration with
datasets.yamlfor organized dataset registration - Pandas-Compatible API: Familiar pandas-like interface via
from sunstone import pandas as pd(CSV, Excel, JSON) - Plugin System: Extensible architecture for custom auth providers, URL handlers, and format handlers via entry points
- Strict/Relaxed Modes: Control whether operations can modify
datasets.yaml - Validation Tools: Check notebooks and scripts for correct import usage
- Full Type Hints: Complete type hint support for better IDE integration
Installation
# Using uv (recommended)
uv add sunstone-py
# Using pip
pip install sunstone-py
To use the latest commit from github:
dependencies = [
"sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]
If you are making changes to a local checkout of sunstone-py and want to test them
from your project, add a [tool.uv.sources] override to your project's pyproject.toml:
[tool.uv.sources]
sunstone-py = { path = "../path/to/sunstone-py", editable = true }
The path is relative to your project's pyproject.toml. Leave the regular PyPI dependency
in [project.dependencies] unchanged — the sources override takes precedence locally.
Remember to remove the [tool.uv.sources] block before committing.
For Development
git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Quick Start
1. Set Up Your Project with datasets.yaml
Create a datasets.yaml file in your project directory:
inputs:
- name: School Data
slug: school-data
location: data/schools.csv
source:
name: Ministry of Education
location:
data: https://example.com/schools.csv
attributedTo: Ministry of Education
acquiredAt: 2025-01-15
acquisitionMethod: manual-download
license: CC-BY-4.0
fields:
- name: school_id
type: string
- name: enrollment
type: integer
outputs: []
2. Use Pandas-Like API with Lineage Tracking
from sunstone import pandas as pd
from pathlib import Path
# Set project path (where datasets.yaml lives)
PROJECT_PATH = Path.cwd()
# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)
# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()
# Save with automatic lineage tracking and dataset registration
result.to_csv(
'outputs/summary.csv',
slug='school-summary',
name='School Enrollment Summary',
index=False
)
3. Check Lineage Metadata
# View lineage information
print(result.lineage.sources) # Source datasets
print(result.lineage.operations) # Operations performed
print(result.lineage.get_licenses()) # All source licenses
Core Concepts
Pandas-Like API
sunstone-py provides a drop-in replacement for pandas that adds lineage tracking:
from sunstone import pandas as pd
# Works like pandas, but tracks lineage
df = pd.read_csv('input.csv', project_path='/path/to/project')
df2 = pd.read_csv('input2.csv', project_path='/path/to/project')
# All pandas operations work
filtered = df[df['value'] > 100]
grouped = df.groupby('category').sum()
# Merge/join operations combine lineage from both sources
merged = pd.merge(df, df2, on='key')
concatenated = pd.concat([df, df2])
Strict vs Relaxed Mode
Relaxed Mode (default):
- Writing to new outputs auto-registers them in
datasets.yaml - More flexible for exploratory work
Strict Mode:
- All reads and writes must be pre-registered in
datasets.yaml - Ensures complete documentation of data operations
- Enable via
strict=Trueparameter orSUNSTONE_DATAFRAME_STRICT=1environment variable
# Enable strict mode
df = pd.read_csv('data.csv', project_path=PROJECT_PATH, strict=True)
# Or globally
import os
os.environ['SUNSTONE_DATAFRAME_STRICT'] = '1'
Validation Tools
Check notebooks for correct import usage:
import sunstone
# Check a single notebook
result = sunstone.check_notebook_imports('analysis.ipynb')
print(result.summary())
# Check all notebooks in project
results = sunstone.validate_project_notebooks('/path/to/project')
for path, result in results.items():
if not result.is_valid:
print(f"\n{path}:")
print(result.summary())
Plugin System
sunstone-py uses a plugin architecture for reading, writing, and fetching data. Built-in handlers cover common formats (CSV, JSON, Excel, Parquet, TSV) and HTTP/HTTPS, local file, GCS, and S3/R2 URLs.
Plugin Protocols
Plugins implement one or more of these protocols:
AuthProvider: Injects authentication headers into HTTP requestsURLHandler: Opens URLs for reading/writing, returning file-like streams (BinaryIO/TextIO)FormatHandler: Reads and writes data formats not built into sunstone
Installation Extras
pip install sunstone-py # Core + HTTP + local file handling
pip install sunstone-py[gcs] # Adds GCS (gs://) support
pip install sunstone-py[s3] # Adds S3 (s3://) and R2 (r2://) support
pip install sunstone-py[gcs,s3] # Both
Registering Custom Plugins
Plugins are discovered via Python entry points:
[project.entry-points."sunstone.plugins"]
my-plugin = "my_package:MyPlugin"
Plugin Configuration
Plugin config uses cascading precedence (later sources override earlier):
datasets.yaml—plugins.<name>sectionpyproject.toml—[tool.sunstone.plugins.<name>]table- Environment variables —
SUNSTONE_PLUGIN_<NAME>_<KEY>
Advanced Usage
Direct DataFrame API
For more control, use the DataFrame class directly:
from sunstone import DataFrame
# Read with explicit parameters
df = DataFrame.read_csv(
'data.csv',
project_path='/path/to/project',
strict=True
)
# Access underlying pandas DataFrame
pandas_df = df.data
DataFrame Metadata
Set metadata on DataFrames that flows through to datasets.yaml on write:
from sunstone import pandas as pd
df = pd.read_csv('input.csv', project_path=PROJECT_PATH)
result = df[df['value'] > 100]
# Set output identity and description
result.metadata.slug = "filtered-data"
result.metadata.name = "Filtered Data"
result.metadata.description = "Values above threshold"
# Set RDF metadata
result.metadata.rdf_prefixes = {"schema": "https://schema.org/"}
result.metadata.custom_properties = {"schema:about": "Analysis"}
# Annotate columns
result.set_field_metadata("value", description="Measured value", unit="kg")
# Write — slug/name come from metadata
result.to_csv('outputs/filtered.csv', index=False)
Available metadata:
df.metadata.slug: Dataset slug (used at write time)df.metadata.name: Dataset name (used at write time)df.metadata.description: Dataset descriptiondf.metadata.rdf_prefixes: RDF namespace prefixesdf.metadata.custom_properties: Custom properties (RDF-style)df.set_field_metadata(column, *, description, unit, source, type, constraints): Annotate a column
Managing datasets.yaml Programmatically
from sunstone import DatasetsManager, FieldSchema
manager = DatasetsManager('/path/to/project')
# Find datasets
dataset = manager.find_dataset_by_slug('school-data')
dataset = manager.find_dataset_by_location('data/schools.csv')
# Add new output dataset
manager.add_output_dataset(
name='Analysis Results',
slug='analysis-results',
location='outputs/results.csv',
fields=[
FieldSchema(name='category', type='string'),
FieldSchema(name='count', type='integer'),
FieldSchema(name='avg_value', type='number')
],
publish=True
)
Documentation
- Contributing Guide
- Changelog
- API Reference (below)
API Reference
pandas Module
Drop-in replacement for pandas with lineage tracking:
read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineageread_excel(filepath, project_path, strict=False, **kwargs): Read Excel (.xlsx/.xls) with lineageread_json(filepath, project_path, strict=False, **kwargs): Read JSON with lineagemerge(left, right, **kwargs): Merge DataFrames with combined lineageconcat(dfs, **kwargs): Concatenate DataFrames with combined lineage
DataFrame Class
Main class for working with data:
read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage trackingread_excel(filepath, project_path, strict=False, **kwargs): Read Excel with lineage trackingto_csv(path, slug, name, publish=False, **kwargs): Write CSV and registermerge(right, **kwargs): Merge with another DataFramejoin(other, **kwargs): Join with another DataFrameconcat(others, **kwargs): Concatenate DataFramesset_field_metadata(column, **kwargs): Annotate column metadata.data: Access underlying pandas DataFrame.metadata: Access unified metadata container.lineage: Access lineage metadata (deprecated — use.metadata.lineage)
DatasetsManager Class
Manage datasets.yaml files:
find_dataset_by_location(location, dataset_type='input'): Find by file pathfind_dataset_by_slug(slug, dataset_type='input'): Find by slugget_all_inputs(): Get all input datasetsget_all_outputs(): Get all output datasetsadd_output_dataset(...): Register new outputupdate_output_dataset(...): Update existing output
Validation Functions
check_notebook_imports(notebook_path): Validate a single notebookvalidate_project_notebooks(project_path): Validate all notebooks in project
Plugin Protocols
AuthProvider: Implementauthenticate(url, headers, dataset) -> headersto inject authURLHandler: Implementcan_handle(url) -> boolandopen(url, mode) -> BinaryIO | TextIOFormatHandler: Implementcan_read(path, format),read(stream, **kwargs),can_write(path, format),write(df, stream, **kwargs)
PluginRegistry Class
Singleton that discovers and manages plugins:
PluginRegistry.get(): Get the singleton registry instanceget_auth_providers(): Return all registered auth providersget_url_handlers(): Return all registered URL handlersget_format_handlers(): Return all registered format handlersfind_url_handler(url): Find first handler that can handle a URLfind_format_reader(path, format): Find first handler that can read a filefind_format_writer(path, format): Find first handler that can write a filefetch(url, dest): Convenience — download URL to local file viaopen()
Exceptions
SunstoneError: Base exceptionDatasetNotFoundError: Dataset not found in datasets.yamlStrictModeError: Operation blocked in strict modeDatasetValidationError: Validation failedLineageError: Lineage tracking error
Environment Variables
SUNSTONE_DATAFRAME_STRICT: Set to"1"or"true"to enable strict mode globallySUNSTONE_PLUGIN_<NAME>_<KEY>: Override plugin configuration (highest precedence)
Development
See CONTRIBUTING.md for development setup and guidelines.
Running Tests
uv run pytest
Type Checking
uv run mypy
Linting and Formatting
uv run ruff check
uv run ruff format
About Sunstone Institute
Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
Made with ❤️ by Sunstone Institute
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sunstone_py-1.10.0.tar.gz.
File metadata
- Download URL: sunstone_py-1.10.0.tar.gz
- Upload date:
- Size: 181.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb4cf1ea9617f072b5999557e99b28bdc1ad2b3aee3b9c2e6e202c7a5ced30c2
|
|
| MD5 |
15c1b940225d19335f2e0bda38f9b303
|
|
| BLAKE2b-256 |
0067536715c376e81a94b1dcc85a562ab9a716e464fe4a92d0927f35a698b449
|
Provenance
The following attestation bundles were made for sunstone_py-1.10.0.tar.gz:
Publisher:
release.yml on sunstoneinstitute/sunstone-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sunstone_py-1.10.0.tar.gz -
Subject digest:
cb4cf1ea9617f072b5999557e99b28bdc1ad2b3aee3b9c2e6e202c7a5ced30c2 - Sigstore transparency entry: 1444569990
- Sigstore integration time:
-
Permalink:
sunstoneinstitute/sunstone-py@bce141914d43fe399a27a2035fbab2c6bd258a0a -
Branch / Tag:
refs/tags/v1.10.0 - Owner: https://github.com/sunstoneinstitute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bce141914d43fe399a27a2035fbab2c6bd258a0a -
Trigger Event:
push
-
Statement type:
File details
Details for the file sunstone_py-1.10.0-py3-none-any.whl.
File metadata
- Download URL: sunstone_py-1.10.0-py3-none-any.whl
- Upload date:
- Size: 94.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1765d9d5ce93028f200bed7f58519d5bd1105ed599fdcb0c162be90df3e2bcd9
|
|
| MD5 |
c4a399827cf60a8c0dd1feaed2306071
|
|
| BLAKE2b-256 |
6514f0d561b266b22442d94c62e7460581143842b6d30149d98f8d5ee4e465d7
|
Provenance
The following attestation bundles were made for sunstone_py-1.10.0-py3-none-any.whl:
Publisher:
release.yml on sunstoneinstitute/sunstone-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sunstone_py-1.10.0-py3-none-any.whl -
Subject digest:
1765d9d5ce93028f200bed7f58519d5bd1105ed599fdcb0c162be90df3e2bcd9 - Sigstore transparency entry: 1444570054
- Sigstore integration time:
-
Permalink:
sunstoneinstitute/sunstone-py@bce141914d43fe399a27a2035fbab2c6bd258a0a -
Branch / Tag:
refs/tags/v1.10.0 - Owner: https://github.com/sunstoneinstitute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bce141914d43fe399a27a2035fbab2c6bd258a0a -
Trigger Event:
push
-
Statement type: