Skip to main content

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

Project description

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

SABER is a research system that integrates multiple semantic document processing frameworks (LOTUS, DocETL, Palimpzest) with a unified SQL-compatible interface.

Installation

Development Installation

git clone https://github.com/xlab-ub/saber.git
cd saber
pip install -e .[all]

Handling Dependency Conflicts

If you encounter dependency conflicts:

Use conda environments and force installation scripts:

conda create -n saber python=3.12 -y
conda activate saber
git clone https://github.com/xlab-ub/saber.git
cd saber
./scripts/install_all_force.sh

Running Examples

Semantic Operations Examples

Semantic WHERE

Filter rows based on semantic conditions rather than exact matches.

python examples/semantic_ops_examples/semantic_where.py

Semantic SELECT

Extract and transform columns using semantic understanding and natural language instructions.

python examples/semantic_ops_examples/semantic_select.py

Semantic JOIN

Join tables based on semantic relationships rather than exact key matches.

python examples/semantic_ops_examples/semantic_join.py

Semantic GROUP BY

Group records by semantic similarity or conceptual categories.

python examples/semantic_ops_examples/semantic_group_by.py

Semantic AGGREGATION

Perform aggregations with semantic understanding of the data.

python examples/semantic_ops_examples/semantic_aggregation.py

Semantic ORDER BY

Sort results based on semantic criteria like relevance, similarity, or conceptual ordering.

python examples/semantic_ops_examples/semantic_order_by.py

Semantic DISTINCT

Remove duplicates based on semantic similarity rather than exact matches.

python examples/semantic_ops_examples/semantic_distinct.py

Semantic INTERSECT (ALL) and EXCEPT (ALL)

Perform semantic (INTERSECT, EXCEPT) operations based on semantic relationships.

# Semantic INTERSECT - Find semantically overlapping records
python examples/semantic_ops_examples/semantic_intersect.py
python examples/semantic_ops_examples/semantic_intersect_all.py

# Semantic EXCEPT - Find semantically different records  
python examples/semantic_ops_examples/semantic_except.py
python examples/semantic_ops_examples/semantic_except_all.py

Unified Query Examples

Backend-Agnostic Semantic Query Rewriting

Demonstrates how SABER automatically rewrites backend-free semantic queries to work with different Semantic Data Processing Systems (LOTUS, DocETL, Palimpzest) without requiring users to modify their code.

python examples/unified_query_examples/unified_query.py

Citation

If you find this code useful, please consider citing our paper:

@misc{lee2025sabersqlcompatiblesemanticdocument,
      title={SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra}, 
      author={Changjae Lee and Zhuoyue Zhao and Jinjun Xiong},
      year={2025},
      eprint={2509.00277},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2509.00277}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saber_query-0.5.0.tar.gz (56.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saber_query-0.5.0-py3-none-any.whl (56.0 kB view details)

Uploaded Python 3

File details

Details for the file saber_query-0.5.0.tar.gz.

File metadata

  • Download URL: saber_query-0.5.0.tar.gz
  • Upload date:
  • Size: 56.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.5.0.tar.gz
Algorithm Hash digest
SHA256 18b18e9166e526bb356a9da955bad8092f49a0d1dbf10e29c2352b91f0cb0073
MD5 29a718c650a98d78ab4e9220c89ba093
BLAKE2b-256 136486e0048d14e2f7053c7cd88129c80b0632e3930291b1b457db4122e14caa

See more details on using hashes here.

File details

Details for the file saber_query-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: saber_query-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 56.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for saber_query-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e4de6ff09cf4cef1a6973ac3788eba79f29dd88d97d9ebd16da24bbe6425d9c
MD5 978f0488dc24ba6a99f52afd70c467a0
BLAKE2b-256 02856c73be1fa81f1b7f74e22f13df40fff5785513d7956632c55e50dd1c1a01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page