Skip to main content

Python library for managing datasets with lineage tracking in Sunstone projects

Project description

sunstone-py

A Python library for managing datasets with lineage tracking in data science projects.

Python 3.12+ License: MIT

Features

  • Automatic Lineage Tracking: Track data provenance through all operations automatically
  • Dataset Management: Integration with datasets.yaml for organized dataset registration
  • Pandas-Compatible API: Familiar pandas-like interface via from sunstone import pandas as pd (CSV, Excel, JSON)
  • Strict/Relaxed Modes: Control whether operations can modify datasets.yaml
  • Validation Tools: Check notebooks and scripts for correct import usage
  • Full Type Hints: Complete type hint support for better IDE integration

Installation

# Using uv (recommended)
uv add sunstone-py

# Using pip
pip install sunstone-py

To use the latest commit from github:

dependencies = [
    "sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]

If you are making changes to a local checkout of sunstone-py and want to test them from your project, add a [tool.uv.sources] override to your project's pyproject.toml:

[tool.uv.sources]
sunstone-py = { path = "../path/to/sunstone-py", editable = true }

The path is relative to your project's pyproject.toml. Leave the regular PyPI dependency in [project.dependencies] unchanged — the sources override takes precedence locally. Remember to remove the [tool.uv.sources] block before committing.

For Development

git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Quick Start

1. Set Up Your Project with datasets.yaml

Create a datasets.yaml file in your project directory:

inputs:
  - name: School Data
    slug: school-data
    location: data/schools.csv
    source:
      name: Ministry of Education
      location:
        data: https://example.com/schools.csv
      attributedTo: Ministry of Education
      acquiredAt: 2025-01-15
      acquisitionMethod: manual-download
      license: CC-BY-4.0
    fields:
      - name: school_id
        type: string
      - name: enrollment
        type: integer

outputs: []

2. Use Pandas-Like API with Lineage Tracking

from sunstone import pandas as pd
from pathlib import Path

# Set project path (where datasets.yaml lives)
PROJECT_PATH = Path.cwd()

# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)

# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()

# Save with automatic lineage tracking and dataset registration
result.to_csv(
    'outputs/summary.csv',
    slug='school-summary',
    name='School Enrollment Summary',
    index=False
)

3. Check Lineage Metadata

# View lineage information
print(result.lineage.sources)      # Source datasets
print(result.lineage.operations)   # Operations performed
print(result.lineage.get_licenses())  # All source licenses

Core Concepts

Pandas-Like API

sunstone-py provides a drop-in replacement for pandas that adds lineage tracking:

from sunstone import pandas as pd

# Works like pandas, but tracks lineage
df = pd.read_csv('input.csv', project_path='/path/to/project')
df2 = pd.read_csv('input2.csv', project_path='/path/to/project')

# All pandas operations work
filtered = df[df['value'] > 100]
grouped = df.groupby('category').sum()

# Merge/join operations combine lineage from both sources
merged = pd.merge(df, df2, on='key')
concatenated = pd.concat([df, df2])

Strict vs Relaxed Mode

Relaxed Mode (default):

  • Writing to new outputs auto-registers them in datasets.yaml
  • More flexible for exploratory work

Strict Mode:

  • All reads and writes must be pre-registered in datasets.yaml
  • Ensures complete documentation of data operations
  • Enable via strict=True parameter or SUNSTONE_DATAFRAME_STRICT=1 environment variable
# Enable strict mode
df = pd.read_csv('data.csv', project_path=PROJECT_PATH, strict=True)

# Or globally
import os
os.environ['SUNSTONE_DATAFRAME_STRICT'] = '1'

Validation Tools

Check notebooks for correct import usage:

import sunstone

# Check a single notebook
result = sunstone.check_notebook_imports('analysis.ipynb')
print(result.summary())

# Check all notebooks in project
results = sunstone.validate_project_notebooks('/path/to/project')
for path, result in results.items():
    if not result.is_valid:
        print(f"\n{path}:")
        print(result.summary())

Advanced Usage

Direct DataFrame API

For more control, use the DataFrame class directly:

from sunstone import DataFrame

# Read with explicit parameters
df = DataFrame.read_csv(
    'data.csv',
    project_path='/path/to/project',
    strict=True
)

# Apply custom operations with lineage tracking
result = df.apply_operation(
    lambda d: d[d['value'] > 100],
    description="Filter high-value rows"
)

# Access underlying pandas DataFrame
pandas_df = result.data

Managing datasets.yaml Programmatically

from sunstone import DatasetsManager, FieldSchema

manager = DatasetsManager('/path/to/project')

# Find datasets
dataset = manager.find_dataset_by_slug('school-data')
dataset = manager.find_dataset_by_location('data/schools.csv')

# Add new output dataset
manager.add_output_dataset(
    name='Analysis Results',
    slug='analysis-results',
    location='outputs/results.csv',
    fields=[
        FieldSchema(name='category', type='string'),
        FieldSchema(name='count', type='integer'),
        FieldSchema(name='avg_value', type='number')
    ],
    publish=True
)

Documentation

API Reference

pandas Module

Drop-in replacement for pandas with lineage tracking:

  • read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage
  • read_excel(filepath, project_path, strict=False, **kwargs): Read Excel (.xlsx/.xls) with lineage
  • read_json(filepath, project_path, strict=False, **kwargs): Read JSON with lineage
  • merge(left, right, **kwargs): Merge DataFrames with combined lineage
  • concat(dfs, **kwargs): Concatenate DataFrames with combined lineage

DataFrame Class

Main class for working with data:

  • read_csv(filepath, project_path, strict=False, **kwargs): Read CSV with lineage tracking
  • read_excel(filepath, project_path, strict=False, **kwargs): Read Excel with lineage tracking
  • to_csv(path, slug, name, publish=False, **kwargs): Write CSV and register
  • merge(right, **kwargs): Merge with another DataFrame
  • join(other, **kwargs): Join with another DataFrame
  • concat(others, **kwargs): Concatenate DataFrames
  • apply_operation(operation, description): Apply transformation with lineage
  • .data: Access underlying pandas DataFrame
  • .lineage: Access lineage metadata

DatasetsManager Class

Manage datasets.yaml files:

  • find_dataset_by_location(location, dataset_type='input'): Find by file path
  • find_dataset_by_slug(slug, dataset_type='input'): Find by slug
  • get_all_inputs(): Get all input datasets
  • get_all_outputs(): Get all output datasets
  • add_output_dataset(...): Register new output
  • update_output_dataset(...): Update existing output

Validation Functions

  • check_notebook_imports(notebook_path): Validate a single notebook
  • validate_project_notebooks(project_path): Validate all notebooks in project

Exceptions

  • SunstoneError: Base exception
  • DatasetNotFoundError: Dataset not found in datasets.yaml
  • StrictModeError: Operation blocked in strict mode
  • DatasetValidationError: Validation failed
  • LineageError: Lineage tracking error

Environment Variables

  • SUNSTONE_DATAFRAME_STRICT: Set to "1" or "true" to enable strict mode globally

Development

See CONTRIBUTING.md for development setup and guidelines.

Running Tests

uv run pytest

Type Checking

uv run mypy

Linting and Formatting

uv run ruff check
uv run ruff format

About Sunstone Institute

Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.

License

MIT License - see LICENSE file for details.

Support


Made with ❤️ by Sunstone Institute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sunstone_py-1.2.4.tar.gz (68.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sunstone_py-1.2.4-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file sunstone_py-1.2.4.tar.gz.

File metadata

  • Download URL: sunstone_py-1.2.4.tar.gz
  • Upload date:
  • Size: 68.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sunstone_py-1.2.4.tar.gz
Algorithm Hash digest
SHA256 f1e99fc99cac9ca11e649bea87bedc65bacb3ce6573ed95910abfd260456859c
MD5 57621888719029875f29282a863d04e0
BLAKE2b-256 45383e9f7d50416559a93bf7ffc92fcfbcb7f403ab774c1ae57f111b70f69aa6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sunstone_py-1.2.4.tar.gz:

Publisher: release.yml on sunstoneinstitute/sunstone-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sunstone_py-1.2.4-py3-none-any.whl.

File metadata

  • Download URL: sunstone_py-1.2.4-py3-none-any.whl
  • Upload date:
  • Size: 46.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sunstone_py-1.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 169dd196d2ebaed01714b315f43d2e7723d2fc7dc010ee32422b1ceef7d699ff
MD5 a3d813e080d851c4d4fc3b915bce347d
BLAKE2b-256 6f6e58445ac9f3363adc23db78f0cc18555d75f34d559dacd323f629137a4d2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sunstone_py-1.2.4-py3-none-any.whl:

Publisher: release.yml on sunstoneinstitute/sunstone-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page