Skip to main content

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

Project description

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

SABER is a research system that integrates multiple semantic document processing frameworks (LOTUS, DocETL, Palimpzest) with a unified SQL-compatible interface.

Installation

Development Installation

git clone https://github.com/xlab-ub/saber.git
cd saber
pip install -e .[all]

Handling Dependency Conflicts

If you encounter dependency conflicts:

Use conda environments and force installation scripts:

conda create -n saber python=3.12 -y
conda activate saber
git clone https://github.com/xlab-ub/saber.git
cd saber
./scripts/install_all_force.sh

Running Examples

Semantic Operations Examples

Semantic WHERE

Filter rows based on semantic conditions rather than exact matches.

python examples/semantic_ops_examples/semantic_where.py

Semantic SELECT

Extract and transform columns using semantic understanding and natural language instructions.

python examples/semantic_ops_examples/semantic_select.py

Semantic JOIN

Join tables based on semantic relationships rather than exact key matches.

python examples/semantic_ops_examples/semantic_join.py

Semantic GROUP BY

Group records by semantic similarity or conceptual categories.

python examples/semantic_ops_examples/semantic_group_by.py

Semantic AGGREGATION

Perform aggregations with semantic understanding of the data.

python examples/semantic_ops_examples/semantic_aggregation.py

Semantic ORDER BY

Sort results based on semantic criteria like relevance, similarity, or conceptual ordering.

python examples/semantic_ops_examples/semantic_order_by.py

Semantic DISTINCT

Remove duplicates based on semantic similarity rather than exact matches.

python examples/semantic_ops_examples/semantic_distinct.py

Semantic INTERSECT (ALL) and EXCEPT (ALL)

Perform semantic (INTERSECT, EXCEPT) operations based on semantic relationships.

# Semantic INTERSECT - Find semantically overlapping records
python examples/semantic_ops_examples/semantic_intersect.py
python examples/semantic_ops_examples/semantic_intersect_all.py

# Semantic EXCEPT - Find semantically different records  
python examples/semantic_ops_examples/semantic_except.py
python examples/semantic_ops_examples/semantic_except_all.py

Unified Query Examples

Backend-Agnostic Semantic Query Rewriting

Demonstrates how SABER automatically rewrites backend-free semantic queries to work with different Semantic Data Processing Systems (LOTUS, DocETL, Palimpzest) without requiring users to modify their code.

python examples/unified_query_examples/unified_query.py

Citation

If you find this code useful, please consider citing our paper:

@misc{lee2025sabersqlcompatiblesemanticdocument,
      title={SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra}, 
      author={Changjae Lee and Zhuoyue Zhao and Jinjun Xiong},
      year={2025},
      eprint={2509.00277},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2509.00277}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saber_query-0.6.0.tar.gz (65.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saber_query-0.6.0-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file saber_query-0.6.0.tar.gz.

File metadata

  • Download URL: saber_query-0.6.0.tar.gz
  • Upload date:
  • Size: 65.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.6.0.tar.gz
Algorithm Hash digest
SHA256 6e2a9292a007edbc95d638b918097c9145c6ad410ad8e0a04138e91d5343b7fe
MD5 2f048c20ae071099b30aa0c500d6ca24
BLAKE2b-256 71efdad8b2c45616c60fd87b25edb5d21cfc0b02f229cdff6157004a30eaf6c7

See more details on using hashes here.

File details

Details for the file saber_query-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: saber_query-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42341fb261edfc36ddefc06e3f7553e58d8b836fcaf8c4699004b9e06640a7b1
MD5 cd592d715d61b5a83f9689ffc1150797
BLAKE2b-256 8a384e0320724d947d9432059b8dda5aa991b71d5e4acb69753e230906cf69b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page