Skip to main content

Schema Lineage Composite Evaluation - A Python package for evaluating schema lineage extraction accuracy

Project description

SLiCE: Schema Lineage Composite Evaluation

PyPI version Python 3.9+ License: MIT Paper: ArXiv

SLiCE is a Python package for evaluating schema lineage extraction accuracy by comparing model predictions with gold standards. It provides comprehensive metrics for assessing the quality of schema lineage extraction in data pipeline analysis.

Features

  • Component-wise Evaluation: Separate scoring for source schema, source tables, transformations, and aggregations
  • Multiple Similarity Metrics: BLEU scores, fuzzy matching, F1 scores, and AST-based similarity
  • Flexible Weighting: Customizable weights for different components and metrics
  • Multi-language Support: Handles Python, SQL, and C# code in transformations
  • Sample Data Module: Built-in access to curated datasets for testing and demonstration
  • Batch Processing: Parallel evaluation of multiple lineage pairs
  • Command Line Interface: Easy-to-use CLI for quick evaluations

Installation

From PyPI (recommended)

pip install slice-score

From Source

git clone https://github.com/microsoft/SLiCE.git
cd SLiCE
pip install -e .

Development Installation

For development with all testing and linting tools:

git clone https://github.com/microsoft/SLiCE.git
cd SLiCE

# Using pip
pip install -e ".[dev]"

# Using uv (recommended - faster)
uv sync --extra dev

Quick Start

Python API

from slice import SchemaLineageEvaluator

# Initialize evaluator
evaluator = SchemaLineageEvaluator()

# Example lineage data
predicted = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss",
    "transformation": "R.cuisine_type AS CuisineType", 
    "aggregation": "COUNT() GROUP BY restaurant_id"
}

ground_truth = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss", 
    "transformation": "R.cuisine_type AS CuisineType",
    "aggregation": ""
}

# Evaluate
results = evaluator.evaluate(predicted, ground_truth)
print(f"Overall Score: {results['overall']:.4f}")

Command Line Interface

# Basic evaluation
slice-eval predicted.json ground_truth.json

# With custom weights
slice-eval --weights source_table=0.5,transformation=0.3,aggregation=0.2 predicted.json ground_truth.json

# Include metadata evaluation
slice-eval --metadata predicted.json ground_truth.json

# Save results to file
slice-eval predicted.json ground_truth.json --output results.txt

Data Format

SLiCE expects lineage data as dictionaries with the following structure:

{
    "source_schema": "column_name",
    "source_table": "table_references",
    "transformation": "transformation_logic",
    "aggregation": "aggregation_operations",
    "metadata": "additional_metadata (optional)"
}

Evaluation Metrics

Component Scores

  • Source Schema: Exact match of schema/column names
  • Source Table: F1 score + fuzzy matching of table references
  • Transformation: BLEU + weighted BLEU + AST similarity
  • Aggregation: BLEU + weighted BLEU + AST similarity
  • Metadata: BLEU + weighted BLEU + AST similarity (optional)

Overall Score

The final score combines component scores using configurable weights:

Overall = format_correctness × source_schema × (
    w₁ × source_table_score + 
    w₂ × transformation_score + 
    w₃ × aggregation_score +
    w₄ × metadata_score  # if applicable
)

Default weights: source_table=0.4, transformation=0.4, aggregation=0.2

Configuration

Custom Weights

# Component weights
weights = {
    'source_table': 0.5,
    'transformation': 0.3, 
    'aggregation': 0.2
}

# Metric weights for transformations
transformation_weights = {
    'bleu': 0.6,
    'weighted_bleu': 0.3,
    'ast': 0.1
}

evaluator = SchemaLineageEvaluator(
    weights=weights,
    transformation_weights=transformation_weights
)

Language Support

# Custom syntax and operators
evaluator = SchemaLineageEvaluator(
    sql_syntax={'SELECT', 'FROM', 'WHERE'},
    python_syntax={'def', 'class', 'import'},
    csharp_syntax={'using', 'namespace', 'class'}
)

Examples

See the examples/ directory for complete usage examples:

  • basic_usage.py: Basic evaluation with default settings
  • custom_weights.py: Using custom weights and configurations
  • batch_evaluation.py: Processing multiple lineage pairs
  • sample_data_usage.py: Using package sample data for evaluation.

Testing

Run the test suite:

# Using pip
pip install -e ".[dev]"
pytest

# Using uv (recommended)
uv sync --extra dev
uv run pytest

# Run with coverage
uv run pytest --cov=slice

# Run specific test file
uv run pytest tests/test_schema_lineage_evaluator.py -v

# Code quality checks
uv run black slice/ tests/     # Format code
uv run flake8 slice/           # Lint code  
uv run mypy slice/             # Type checking

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for your changes
  5. Run the test suite (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use SLiCE in your research, please cite:

@software{slice2025,
  title={SLiCE: Schema Lineage Composite Evaluation},
  author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
  year={2025},
  url={https://github.com/microsoft/SLiCE}
}

@misc{yin2025schemalineageextractionscale,
      title={Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks}, 
      author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
      year={2025},
      eprint={2508.07179},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.07179}, 
}

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slice_score-1.0.1.tar.gz (164.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slice_score-1.0.1-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file slice_score-1.0.1.tar.gz.

File metadata

  • Download URL: slice_score-1.0.1.tar.gz
  • Upload date:
  • Size: 164.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slice_score-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c1b00cc7f746c4605d4c12be13453e54fb2614c9054eba03a9180163fb7c6083
MD5 38ed37e9fda5cd77b1a9a31437ffe989
BLAKE2b-256 8d61f4b36f5ba080069fcaf7ff0f6f4bcd597f1c08e29064aec04fe45f418e6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for slice_score-1.0.1.tar.gz:

Publisher: publish.yml on microsoft/SLiCE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slice_score-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: slice_score-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slice_score-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 20d308c7ef306f8c77948ae558bb20644d61f32a713766aa592d00fffb36324a
MD5 e1b34c50bbb6ceb32098b80ee27b75a3
BLAKE2b-256 496199b17ed216e6e875f2b628d2d7003855d151cdd1ce0fdc40ac2b1ff0d845

See more details on using hashes here.

Provenance

The following attestation bundles were made for slice_score-1.0.1-py3-none-any.whl:

Publisher: publish.yml on microsoft/SLiCE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page