Skip to main content

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

Project description

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

SABER is a research system that integrates multiple semantic document processing frameworks (LOTUS, DocETL, Palimpzest) with a unified SQL-compatible interface.

Installation

Development Installation

git clone https://github.com/xlab-ub/saber.git
cd saber
pip install -e .[all]

Handling Dependency Conflicts

If you encounter dependency conflicts:

Use conda environments and force installation scripts:

conda create -n saber python=3.12 -y
conda activate saber
git clone https://github.com/xlab-ub/saber.git
cd saber
./scripts/install_all_force.sh

Running Examples

Semantic Operations Examples

Semantic WHERE

Filter rows based on semantic conditions rather than exact matches.

python examples/semantic_ops_examples/semantic_where.py

Semantic SELECT

Extract and transform columns using semantic understanding and natural language instructions.

python examples/semantic_ops_examples/semantic_select.py

Semantic JOIN

Join tables based on semantic relationships rather than exact key matches.

python examples/semantic_ops_examples/semantic_join.py

Semantic GROUP BY

Group records by semantic similarity or conceptual categories.

python examples/semantic_ops_examples/semantic_group_by.py

Semantic AGGREGATION

Perform aggregations with semantic understanding of the data.

python examples/semantic_ops_examples/semantic_aggregation.py

Semantic ORDER BY

Sort results based on semantic criteria like relevance, similarity, or conceptual ordering.

python examples/semantic_ops_examples/semantic_order_by.py

Semantic DISTINCT

Remove duplicates based on semantic similarity rather than exact matches.

python examples/semantic_ops_examples/semantic_distinct.py

Semantic INTERSECT (ALL) and EXCEPT (ALL)

Perform semantic (INTERSECT, EXCEPT) operations based on semantic relationships.

# Semantic INTERSECT - Find semantically overlapping records
python examples/semantic_ops_examples/semantic_intersect.py
python examples/semantic_ops_examples/semantic_intersect_all.py

# Semantic EXCEPT - Find semantically different records  
python examples/semantic_ops_examples/semantic_except.py
python examples/semantic_ops_examples/semantic_except_all.py

Unified Query Examples

Backend-Agnostic Semantic Query Rewriting

Demonstrates how SABER automatically rewrites backend-free semantic queries to work with different Semantic Data Processing Systems (LOTUS, DocETL, Palimpzest) without requiring users to modify their code.

python examples/unified_query_examples/unified_query.py

Citation

If you find this code useful, please consider citing our paper:

@misc{lee2025sabersqlcompatiblesemanticdocument,
      title={SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra}, 
      author={Changjae Lee and Zhuoyue Zhao and Jinjun Xiong},
      year={2025},
      eprint={2509.00277},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2509.00277}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saber_query-0.2.0.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saber_query-0.2.0-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file saber_query-0.2.0.tar.gz.

File metadata

  • Download URL: saber_query-0.2.0.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.2.0.tar.gz
Algorithm Hash digest
SHA256 713baeb5d6e1d363ae136c9abdcaaee1818ec53822e12d0aa22434a7c9bb11ce
MD5 ef43eaa23fe0b120bf97567be4da43c8
BLAKE2b-256 a2db1680ec6689985461c7e8fa594ac7a9ada32b16ff121d6f22f910853a39cd

See more details on using hashes here.

File details

Details for the file saber_query-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: saber_query-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d2cba5ac38df1f26ce8a12f22d8191bc695c8efbecdae32806b23b5823eda94
MD5 25344e56e20f72248e6fa8846d36de37
BLAKE2b-256 ef72ac6ba3c1afddf296670598a5d559354f70aa9adad5f444865877c50b5a14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page