SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra
Project description
SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra
SABER is a research system that integrates multiple semantic document processing frameworks (LOTUS, DocETL, Palimpzest) with a unified SQL-compatible interface.
Installation
Development Installation
git clone https://github.com/xlab-ub/saber.git
cd saber
pip install -e .[all]
Handling Dependency Conflicts
If you encounter dependency conflicts:
Use conda environments and force installation scripts:
conda create -n saber python=3.12 -y
conda activate saber
git clone https://github.com/xlab-ub/saber.git
cd saber
./scripts/install_all_force.sh
Running Examples
Semantic Operations Examples
Semantic WHERE
Filter rows based on semantic conditions rather than exact matches.
python examples/semantic_ops_examples/semantic_where.py
Semantic SELECT
Extract and transform columns using semantic understanding and natural language instructions.
python examples/semantic_ops_examples/semantic_select.py
Semantic JOIN
Join tables based on semantic relationships rather than exact key matches.
python examples/semantic_ops_examples/semantic_join.py
Semantic GROUP BY
Group records by semantic similarity or conceptual categories.
python examples/semantic_ops_examples/semantic_group_by.py
Semantic AGGREGATION
Perform aggregations with semantic understanding of the data.
python examples/semantic_ops_examples/semantic_aggregation.py
Semantic ORDER BY
Sort results based on semantic criteria like relevance, similarity, or conceptual ordering.
python examples/semantic_ops_examples/semantic_order_by.py
Semantic DISTINCT
Remove duplicates based on semantic similarity rather than exact matches.
python examples/semantic_ops_examples/semantic_distinct.py
Semantic INTERSECT (ALL) and EXCEPT (ALL)
Perform semantic (INTERSECT, EXCEPT) operations based on semantic relationships.
# Semantic INTERSECT - Find semantically overlapping records
python examples/semantic_ops_examples/semantic_intersect.py
python examples/semantic_ops_examples/semantic_intersect_all.py
# Semantic EXCEPT - Find semantically different records
python examples/semantic_ops_examples/semantic_except.py
python examples/semantic_ops_examples/semantic_except_all.py
Unified Query Examples
Backend-Agnostic Semantic Query Rewriting
Demonstrates how SABER automatically rewrites backend-free semantic queries to work with different Semantic Data Processing Systems (LOTUS, DocETL, Palimpzest) without requiring users to modify their code.
python examples/unified_query_examples/unified_query.py
Citation
If you find this code useful, please consider citing our paper:
@misc{lee2025sabersqlcompatiblesemanticdocument,
title={SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra},
author={Changjae Lee and Zhuoyue Zhao and Jinjun Xiong},
year={2025},
eprint={2509.00277},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2509.00277},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saber_query-0.3.0.tar.gz.
File metadata
- Download URL: saber_query-0.3.0.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d15fc3da0daf10c0cbf5e855d9c27cadbb8c5cbb11778fe8260b084be44554dd
|
|
| MD5 |
17215179728d689e51330b910203a4d1
|
|
| BLAKE2b-256 |
1347842b7322ed89fc69b0dddbad2aa03fafdc34c26899d0fcc3630d39875630
|
File details
Details for the file saber_query-0.3.0-py3-none-any.whl.
File metadata
- Download URL: saber_query-0.3.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bae89a043fa46b3ca03d95162e574143c1699e1d87fb23775d3853dad2b5908
|
|
| MD5 |
41a37e9a24f36cefce4e9fa5c397e3d6
|
|
| BLAKE2b-256 |
044dea951866fa76df49008791741101b4598d09c4ba42ebee89bfa895ff923b
|