Python library for managing datasets with lineage tracking in Sunstone projects
Project description
sunstone-py
A Python library for managing datasets with lineage tracking in data science projects.
Features
- Automatic Lineage Tracking: Track data provenance through all operations automatically
- Dataset Management: Integration with
datasets.yamlfor organized dataset registration - Pandas-Compatible API: Familiar pandas-like interface via
from sunstone import pandas as pd - Strict/Relaxed Modes: Control whether operations can modify
datasets.yaml - Validation Tools: Check notebooks and scripts for correct import usage
- Full Type Hints: Complete type hint support for better IDE integration
Installation
# Using uv (recommended)
uv add sunstone-py
# Using pip
pip install sunstone-py
To use the latest commit from github:
dependencies = [
"sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]
If you are making changes to sunstone-py checked out at ~/git/sunstone-py and testing them
directly from your project:
dependencies = [
"sunstone-py @ file://${HOME}/git/sunstone-py"
]
For Development
git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Quick Start
1. Set Up Your Project with datasets.yaml
Create a datasets.yaml file in your project directory:
inputs:
- name: School Data
slug: school-data
location: data/schools.csv
source:
name: Ministry of Education
location:
data: https://example.com/schools.csv
attributedTo: Ministry of Education
acquiredAt: 2025-01-15
acquisitionMethod: manual-download
license: CC-BY-4.0
fields:
- name: school_id
type: string
- name: enrollment
type: integer
outputs: []
2. Use Pandas-Like API with Lineage Tracking
from sunstone import pandas as pd
from pathlib import Path
# Set project path (where datasets.yaml lives)
PROJECT_PATH = Path.cwd()
# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)
# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()
# Save with automatic lineage tracking and dataset registration
result.to_csv(
'outputs/summary.csv',
slug='school-summary',
name='School Enrollment Summary',
index=False
)
3. Check Lineage Metadata
# View lineage information
print(result.lineage.sources) # Source datasets
print(result.lineage.operations) # Operations performed
print(result.lineage.get_licenses()) # All source licenses
Core Concepts
Pandas-Like API
sunstone-py provides a drop-in replacement for pandas that adds lineage tracking:
from sunstone import pandas as pd
# Works like pandas, but tracks lineage
df = pd.read_csv('input.csv', project_path='/path/to/project')
df2 = pd.read_csv('input2.csv', project_path='/path/to/project')
# All pandas operations work
filtered = df[df['value'] > 100]
grouped = df.groupby('category').sum()
# Merge/join operations combine lineage from both sources
merged = pd.merge(df, df2, on='key')
concatenated = pd.concat([df, df2])
Strict vs Relaxed Mode
Relaxed Mode (default):
- Writing to new outputs auto-registers them in
datasets.yaml - More flexible for exploratory work
Strict Mode:
- All reads and writes must be pre-registered in
datasets.yaml - Ensures complete documentation of data operations
- Enable via
strict=Trueparameter orSUNSTONE_DATAFRAME_STRICT=1environment variable
# Enable strict mode
df = pd.read_csv('data.csv', project_path=PROJECT_PATH, strict=True)
# Or globally
import os
os.environ['SUNSTONE_DATAFRAME_STRICT'] = '1'
Validation Tools
Check notebooks for correct import usage:
import sunstone
# Check a single notebook
result = sunstone.check_notebook_imports('analysis.ipynb')
print(result.summary())
# Check all notebooks in project
results = sunstone.validate_project_notebooks('/path/to/project')
for path, result in results.items():
if not result.is_valid:
print(f"\n{path}:")
print(result.summary())
Advanced Usage
Direct DataFrame API
For more control, use the DataFrame class directly:
from sunstone import DataFrame
# Read with explicit parameters
df = DataFrame.read_csv(
'data.csv',
project_path='/path/to/project',
strict=True
)
# Apply custom operations with lineage tracking
result = df.apply_operation(
lambda d: d[d['value'] > 100],
description="Filter high-value rows"
)
# Access underlying pandas DataFrame
pandas_df = result.data
Managing datasets.yaml Programmatically
from sunstone import DatasetsManager, FieldSchema
manager = DatasetsManager('/path/to/project')
# Find datasets
dataset = manager.find_dataset_by_slug('school-data')
dataset = manager.find_dataset_by_location('data/schools.csv')
# Add new output dataset
manager.add_output_dataset(
name='Analysis Results',
slug='analysis-results',
location='outputs/results.csv',
fields=[
FieldSchema(name='category', type='string'),
FieldSchema(name='count', type='integer'),
FieldSchema(name='avg_value', type='number')
],
publish=True
)
Documentation
- Contributing Guide
- Changelog
- API Reference (below)
API Reference
pandas Module
Drop-in replacement for pandas with lineage tracking:
read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineageread_json(filepath, project_path, strict=False, **kwargs): Read JSON with lineagemerge(left, right, **kwargs): Merge DataFrames with combined lineageconcat(dfs, **kwargs): Concatenate DataFrames with combined lineage
DataFrame Class
Main class for working with data:
read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage trackingto_csv(path, slug, name, publish=False, **kwargs): Write CSV and registermerge(right, **kwargs): Merge with another DataFramejoin(other, **kwargs): Join with another DataFrameconcat(others, **kwargs): Concatenate DataFramesapply_operation(operation, description): Apply transformation with lineage.data: Access underlying pandas DataFrame.lineage: Access lineage metadata
DatasetsManager Class
Manage datasets.yaml files:
find_dataset_by_location(location, dataset_type='input'): Find by file pathfind_dataset_by_slug(slug, dataset_type='input'): Find by slugget_all_inputs(): Get all input datasetsget_all_outputs(): Get all output datasetsadd_output_dataset(...): Register new outputupdate_output_dataset(...): Update existing output
Validation Functions
check_notebook_imports(notebook_path): Validate a single notebookvalidate_project_notebooks(project_path): Validate all notebooks in project
Exceptions
SunstoneError: Base exceptionDatasetNotFoundError: Dataset not found in datasets.yamlStrictModeError: Operation blocked in strict modeDatasetValidationError: Validation failedLineageError: Lineage tracking error
Environment Variables
SUNSTONE_DATAFRAME_STRICT: Set to"1"or"true"to enable strict mode globally
Development
See CONTRIBUTING.md for development setup and guidelines.
Running Tests
uv run pytest
Type Checking
uv run mypy src/sunstone
Linting and Formatting
uv run ruff check src/sunstone
uv run ruff format src/sunstone
About Sunstone Institute
Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
Made with ❤️ by Sunstone Institute
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sunstone_py-0.4.2.tar.gz.
File metadata
- Download URL: sunstone_py-0.4.2.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
086406b3d1dee6b610e304f9bdce5e30680a3f647098a55f550afd5dacf237dd
|
|
| MD5 |
0fc2499826b48690dd2d7f21d087f431
|
|
| BLAKE2b-256 |
add9d0cd043300a40856a11e22876d759e31455f872a75c1822c894fea51e8bc
|
Provenance
The following attestation bundles were made for sunstone_py-0.4.2.tar.gz:
Publisher:
release.yml on sunstoneinstitute/sunstone-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sunstone_py-0.4.2.tar.gz -
Subject digest:
086406b3d1dee6b610e304f9bdce5e30680a3f647098a55f550afd5dacf237dd - Sigstore transparency entry: 730747910
- Sigstore integration time:
-
Permalink:
sunstoneinstitute/sunstone-py@a4c5e8136fec91fcc269744a021e6220c109f842 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/sunstoneinstitute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a4c5e8136fec91fcc269744a021e6220c109f842 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sunstone_py-0.4.2-py3-none-any.whl.
File metadata
- Download URL: sunstone_py-0.4.2-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6803c153e72429b8232148d2c69dbf9c21b2a292954cf0def4911b6c9a2c76c3
|
|
| MD5 |
239c6ab971551011b0c7f19ebf6a2ae4
|
|
| BLAKE2b-256 |
902efa03fbbae1c019b0c127313e780da6038dd26f19bc4c454e2909c08f67a8
|
Provenance
The following attestation bundles were made for sunstone_py-0.4.2-py3-none-any.whl:
Publisher:
release.yml on sunstoneinstitute/sunstone-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sunstone_py-0.4.2-py3-none-any.whl -
Subject digest:
6803c153e72429b8232148d2c69dbf9c21b2a292954cf0def4911b6c9a2c76c3 - Sigstore transparency entry: 730747920
- Sigstore integration time:
-
Permalink:
sunstoneinstitute/sunstone-py@a4c5e8136fec91fcc269744a021e6220c109f842 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/sunstoneinstitute
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a4c5e8136fec91fcc269744a021e6220c109f842 -
Trigger Event:
push
-
Statement type: