Skip to main content

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

Project description

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

SABER is a research system that integrates multiple semantic document processing frameworks (LOTUS, DocETL, Palimpzest) with a unified SQL-compatible interface.

Installation

Development Installation

git clone https://github.com/xlab-ub/saber.git
cd saber
pip install -e .[all]

Handling Dependency Conflicts

If you encounter dependency conflicts:

Use conda environments and force installation scripts:

conda create -n saber python=3.12 -y
conda activate saber
git clone https://github.com/xlab-ub/saber.git
cd saber
./scripts/install_all_force.sh

Running Examples

Semantic Operations Examples

Semantic WHERE

Filter rows based on semantic conditions rather than exact matches.

python examples/semantic_ops_examples/semantic_where.py

Semantic SELECT

Extract and transform columns using semantic understanding and natural language instructions.

python examples/semantic_ops_examples/semantic_select.py

Semantic JOIN

Join tables based on semantic relationships rather than exact key matches.

python examples/semantic_ops_examples/semantic_join.py

Semantic GROUP BY

Group records by semantic similarity or conceptual categories.

python examples/semantic_ops_examples/semantic_group_by.py

Semantic AGGREGATION

Perform aggregations with semantic understanding of the data.

python examples/semantic_ops_examples/semantic_aggregation.py

Semantic ORDER BY

Sort results based on semantic criteria like relevance, similarity, or conceptual ordering.

python examples/semantic_ops_examples/semantic_order_by.py

Semantic DISTINCT

Remove duplicates based on semantic similarity rather than exact matches.

python examples/semantic_ops_examples/semantic_distinct.py

Semantic INTERSECT (ALL) and EXCEPT (ALL)

Perform semantic (INTERSECT, EXCEPT) operations based on semantic relationships.

# Semantic INTERSECT - Find semantically overlapping records
python examples/semantic_ops_examples/semantic_intersect.py
python examples/semantic_ops_examples/semantic_intersect_all.py

# Semantic EXCEPT - Find semantically different records  
python examples/semantic_ops_examples/semantic_except.py
python examples/semantic_ops_examples/semantic_except_all.py

Unified Query Examples

Backend-Agnostic Semantic Query Rewriting

Demonstrates how SABER automatically rewrites backend-free semantic queries to work with different Semantic Data Processing Systems (LOTUS, DocETL, Palimpzest) without requiring users to modify their code.

python examples/unified_query_examples/unified_query.py

Citation

If you find this code useful, please consider citing our paper:

@misc{lee2025sabersqlcompatiblesemanticdocument,
      title={SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra}, 
      author={Changjae Lee and Zhuoyue Zhao and Jinjun Xiong},
      year={2025},
      eprint={2509.00277},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2509.00277}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saber_query-0.4.0.tar.gz (54.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saber_query-0.4.0-py3-none-any.whl (53.9 kB view details)

Uploaded Python 3

File details

Details for the file saber_query-0.4.0.tar.gz.

File metadata

  • Download URL: saber_query-0.4.0.tar.gz
  • Upload date:
  • Size: 54.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6fa596089f474bef5ca94c4bbfee8c7051caba4151edaa4d61f44c34afa05185
MD5 d0950ab9784e96b42185f6e754548671
BLAKE2b-256 ef31c85e4c5d2784eee50e819e18679085f4cbd845afdeed10c5a780720e633f

See more details on using hashes here.

File details

Details for the file saber_query-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: saber_query-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 53.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 031c2dcb52105446e143a369436aa342d9c6ccab6d1849c5dd3d7bb89d960651
MD5 a91d09491de2068e7af619718ddbb329
BLAKE2b-256 d5db33a7d4d614f625c5ecd38b9d18f0ea67d09ea68ebaea79933aabbd96ec75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page