Schema Lineage Composite Evaluation - A Python package for evaluating schema lineage extraction accuracy
Project description
SLiCE: Schema Lineage Composite Evaluation
SLiCE is a Python package for evaluating schema lineage extraction accuracy by comparing model predictions with gold standards. It provides comprehensive metrics for assessing the quality of schema lineage extraction in data pipeline analysis.
Features
- Component-wise Evaluation: Separate scoring for source schema, source tables, transformations, and aggregations
- Multiple Similarity Metrics: BLEU scores, fuzzy matching, F1 scores, and AST-based similarity
- Flexible Weighting: Customizable weights for different components and metrics
- Multi-language Support: Handles Python, SQL, and C# code in transformations
- Sample Data Module: Built-in access to curated datasets for testing and demonstration
- Batch Processing: Parallel evaluation of multiple lineage pairs
- Command Line Interface: Easy-to-use CLI for quick evaluations
Installation
From PyPI (recommended)
pip install slice-score
From Source
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE
pip install -e .
Development Installation
For development with all testing and linting tools:
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE
# Using pip
pip install -e ".[dev]"
# Using uv (recommended - faster)
uv sync --extra dev
Quick Start
Python API
from slice import SchemaLineageEvaluator
# Initialize evaluator
evaluator = SchemaLineageEvaluator()
# Example lineage data
predicted = {
"source_schema": "cuisine_type",
"source_table": "restaurants.ss",
"transformation": "R.cuisine_type AS CuisineType",
"aggregation": "COUNT() GROUP BY restaurant_id"
}
ground_truth = {
"source_schema": "cuisine_type",
"source_table": "restaurants.ss",
"transformation": "R.cuisine_type AS CuisineType",
"aggregation": ""
}
# Evaluate
results = evaluator.evaluate(predicted, ground_truth)
print(f"Overall Score: {results['overall']:.4f}")
Command Line Interface
# Basic evaluation
slice-eval predicted.json ground_truth.json
# With custom weights
slice-eval --weights source_table=0.5,transformation=0.3,aggregation=0.2 predicted.json ground_truth.json
# Include metadata evaluation
slice-eval --metadata predicted.json ground_truth.json
# Save results to file
slice-eval predicted.json ground_truth.json --output results.txt
Data Format
SLiCE expects lineage data as dictionaries with the following structure:
{
"source_schema": "column_name",
"source_table": "table_references",
"transformation": "transformation_logic",
"aggregation": "aggregation_operations",
"metadata": "additional_metadata (optional)"
}
Evaluation Metrics
Component Scores
- Source Schema: Exact match of schema/column names
- Source Table: F1 score + fuzzy matching of table references
- Transformation: BLEU + weighted BLEU + AST similarity
- Aggregation: BLEU + weighted BLEU + AST similarity
- Metadata: BLEU + weighted BLEU + AST similarity (optional)
Overall Score
The final score combines component scores using configurable weights:
Overall = format_correctness × source_schema × (
w₁ × source_table_score +
w₂ × transformation_score +
w₃ × aggregation_score +
w₄ × metadata_score # if applicable
)
Default weights: source_table=0.4, transformation=0.4, aggregation=0.2
Configuration
Custom Weights
# Component weights
weights = {
'source_table': 0.5,
'transformation': 0.3,
'aggregation': 0.2
}
# Metric weights for transformations
transformation_weights = {
'bleu': 0.6,
'weighted_bleu': 0.3,
'ast': 0.1
}
evaluator = SchemaLineageEvaluator(
weights=weights,
transformation_weights=transformation_weights
)
Language Support
# Custom syntax and operators
evaluator = SchemaLineageEvaluator(
sql_syntax={'SELECT', 'FROM', 'WHERE'},
python_syntax={'def', 'class', 'import'},
csharp_syntax={'using', 'namespace', 'class'}
)
Examples
See the examples/ directory for complete usage examples:
basic_usage.py: Basic evaluation with default settingscustom_weights.py: Using custom weights and configurationsbatch_evaluation.py: Processing multiple lineage pairssample_data_usage.py: Using package sample data for evaluation.
Testing
Run the test suite:
# Using pip
pip install -e ".[dev]"
pytest
# Using uv (recommended)
uv sync --extra dev
uv run pytest
# Run with coverage
uv run pytest --cov=slice
# Run specific test file
uv run pytest tests/test_schema_lineage_evaluator.py -v
# Code quality checks
uv run black slice/ tests/ # Format code
uv run flake8 slice/ # Lint code
uv run mypy slice/ # Type checking
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for your changes
- Run the test suite (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use SLiCE in your research, please cite:
@software{slice2025,
title={SLiCE: Schema Lineage Composite Evaluation},
author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
year={2025},
url={https://github.com/microsoft/SLiCE}
}
@misc{yin2025schemalineageextractionscale,
title={Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks},
author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
year={2025},
eprint={2508.07179},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07179},
}
Support
- Documentation: [Link to documentation]
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slice_score-1.0.0.tar.gz.
File metadata
- Download URL: slice_score-1.0.0.tar.gz
- Upload date:
- Size: 164.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a054adb20c4b8e68d9aa001899e52b5ca1852c1d2c1c0a98f87640fcd984ac6f
|
|
| MD5 |
19b65582efe608d12a3746ea0e6c52e8
|
|
| BLAKE2b-256 |
3b3af571fe0b521b27d33d9f2d35c6422d14aeb8ceb376969349bcae37af5c70
|
Provenance
The following attestation bundles were made for slice_score-1.0.0.tar.gz:
Publisher:
publish.yml on microsoft/SLiCE
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slice_score-1.0.0.tar.gz -
Subject digest:
a054adb20c4b8e68d9aa001899e52b5ca1852c1d2c1c0a98f87640fcd984ac6f - Sigstore transparency entry: 403136126
- Sigstore integration time:
-
Permalink:
microsoft/SLiCE@6e9c6cc868e0f1707ea2c2a1bb9cfd0895ee622e -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/microsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6e9c6cc868e0f1707ea2c2a1bb9cfd0895ee622e -
Trigger Event:
push
-
Statement type:
File details
Details for the file slice_score-1.0.0-py3-none-any.whl.
File metadata
- Download URL: slice_score-1.0.0-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2900cb7ef7d327134e8920f5c130c9f6fb0e5c0b0a25ae88841d6209c41f605c
|
|
| MD5 |
d941a21083f720ce45f1c7e2d0cabe7a
|
|
| BLAKE2b-256 |
d9b0f228cbcaeca6c7e9572f3ccdd525cb6e36337ba67ee58d43cb0dcfa7fbf4
|
Provenance
The following attestation bundles were made for slice_score-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on microsoft/SLiCE
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slice_score-1.0.0-py3-none-any.whl -
Subject digest:
2900cb7ef7d327134e8920f5c130c9f6fb0e5c0b0a25ae88841d6209c41f605c - Sigstore transparency entry: 403136145
- Sigstore integration time:
-
Permalink:
microsoft/SLiCE@6e9c6cc868e0f1707ea2c2a1bb9cfd0895ee622e -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/microsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6e9c6cc868e0f1707ea2c2a1bb9cfd0895ee622e -
Trigger Event:
push
-
Statement type: