Semantic compression for tabular data and images using vector quantization
Project description
Sempress
Semantic compression for tabular data and images using vector quantization
Sempress achieves 5-15x better compression than gzip on numeric-heavy datasets by learning per-column codebooks with K-Means vector quantization. String and ID columns are preserved losslessly; precision-critical columns can store exact residuals.
Installation
pip install sempress
Optional extras:
pip install sempress[image] # PSNR/SSIM metrics (scikit-image, scipy)
pip install sempress[audio] # Audio compression (librosa, soundfile)
pip install sempress[api] # FastAPI server (fastapi, uvicorn)
pip install sempress[all] # Everything
CLI Usage
# Compress CSV to .smp format
sempress encode --in data.csv --out data.smp --lock-cols id,timestamp --k 64
# Decompress back to CSV
sempress decode --in data.smp --out restored.csv
# Evaluate reconstruction quality
sempress eval --original data.csv --recon restored.csv --lock-cols id,timestamp
Options:
--lock-cols: Columns preserved losslessly (strings, IDs, timestamps)--residual-cols: Columns with exact error stored (financial, scientific)--k: Codebook size per column (default: 64, range: 16-256)--uncert-thresh: Flag cells with relative error above threshold (default: 0.2)
Python API
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig
config = EncodeConfig(
lock_cols=["id", "timestamp"],
residual_cols=["amount"],
k=64,
uncertainty_thresh=0.2,
)
# Compress
blob = encode_csv("data.csv", config)
with open("data.smp", "wb") as f:
f.write(blob)
# Decompress
decode_to_csv(blob, "restored.csv")
How It Works
- Column analysis - auto-detects numeric vs categorical columns
- Learn codebooks - K-Means learns k centroids per numeric column
- Encode to indices - replaces values with nearest centroid index (uint16)
- Add residuals (optional) - stores exact errors for high-precision columns
- Package - msgpack + zstd container (
.smp/ SEMZ1 format) with schema and metadata
Benchmarks
Tested on 10,000 rows of IoT sensor data (1.4 MB):
| Metric | Sempress | gzip | Improvement |
|---|---|---|---|
| Compression Ratio | 15.72x | 2.48x | +533% |
| Final Size | 93 KB | 603 KB | 84% smaller |
| Data Fidelity | 97.5% | 100% (lossless) | Configurable |
Sempress excels on numeric-heavy data (IoT, ML features, financial). For text-heavy or very small tables, gzip may be simpler.
Git LFS Integration
For automatic compression in Git repositories, see the companion plugin: git-lfs-sempress
Research Paper
@article{sempress2025,
title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
author={Anderson, Keaton},
year={2025},
url={https://sempress.net}
}
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sempress-0.3.1.tar.gz.
File metadata
- Download URL: sempress-0.3.1.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35e2ebf1c07646780f9bfbc5cb7b4743009277cf589ee2514baeefbff2b19ba6
|
|
| MD5 |
4156bd7c54a262c04a087b8a3297c1fa
|
|
| BLAKE2b-256 |
035c84fd8304a0c06a01d1e0404f6e1552fd526f1a251d1dac08103a458d8b1d
|
File details
Details for the file sempress-0.3.1-py3-none-any.whl.
File metadata
- Download URL: sempress-0.3.1-py3-none-any.whl
- Upload date:
- Size: 46.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70d877ec280bf43172398a5960c301d7d35b72caaf78749baa22f491aea1b910
|
|
| MD5 |
e6a4f823fd0ceeede0173258b8619f7c
|
|
| BLAKE2b-256 |
b4b41fd12108d1730da00983edf35c5efe6c0d5afd49ff2eb38b8934372935df
|