Skip to main content

Semantic compression for tabular data and images using vector quantization

Project description

Sempress

Semantic compression for tabular data and images using vector quantization

PyPI License Python

Sempress achieves 5-15x better compression than gzip on numeric-heavy datasets by learning per-column codebooks with K-Means vector quantization. String and ID columns are preserved losslessly; precision-critical columns can store exact residuals.

Installation

pip install sempress

Optional extras:

pip install sempress[image]   # PSNR/SSIM metrics (scikit-image, scipy)
pip install sempress[audio]   # Audio compression (librosa, soundfile)
pip install sempress[api]     # FastAPI server (fastapi, uvicorn)
pip install sempress[all]     # Everything

CLI Usage

# Compress CSV to .smp format
sempress encode --in data.csv --out data.smp --lock-cols id,timestamp --k 64

# Decompress back to CSV
sempress decode --in data.smp --out restored.csv

# Evaluate reconstruction quality
sempress eval --original data.csv --recon restored.csv --lock-cols id,timestamp

Options:

  • --lock-cols: Columns preserved losslessly (strings, IDs, timestamps)
  • --residual-cols: Columns with exact error stored (financial, scientific)
  • --k: Codebook size per column (default: 64, range: 16-256)
  • --uncert-thresh: Flag cells with relative error above threshold (default: 0.2)

Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

config = EncodeConfig(
    lock_cols=["id", "timestamp"],
    residual_cols=["amount"],
    k=64,
    uncertainty_thresh=0.2,
)

# Compress
blob = encode_csv("data.csv", config)
with open("data.smp", "wb") as f:
    f.write(blob)

# Decompress
decode_to_csv(blob, "restored.csv")

How It Works

  1. Column analysis - auto-detects numeric vs categorical columns
  2. Learn codebooks - K-Means learns k centroids per numeric column
  3. Encode to indices - replaces values with nearest centroid index (uint16)
  4. Add residuals (optional) - stores exact errors for high-precision columns
  5. Package - msgpack + zstd container (.smp / SEMZ1 format) with schema and metadata

Benchmarks

Tested on 10,000 rows of IoT sensor data (1.4 MB):

Metric Sempress gzip Improvement
Compression Ratio 15.72x 2.48x +533%
Final Size 93 KB 603 KB 84% smaller
Data Fidelity 97.5% 100% (lossless) Configurable

Sempress excels on numeric-heavy data (IoT, ML features, financial). For text-heavy or very small tables, gzip may be simpler.

Git LFS Integration

For automatic compression in Git repositories, see the companion plugin: git-lfs-sempress

Research Paper

sempress.net/paper.pdf

@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sempress-0.3.2.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sempress-0.3.2-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file sempress-0.3.2.tar.gz.

File metadata

  • Download URL: sempress-0.3.2.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.2.tar.gz
Algorithm Hash digest
SHA256 45687430476b6cbb6a937bb00755cb4c276f0809a3de10e243baff12670f1aee
MD5 545efba71af29aa925f8f771eb27a7a5
BLAKE2b-256 2839324b238a562c65c95040cf617971df4a6c0846fa24d3d2e60b3010a549ac

See more details on using hashes here.

File details

Details for the file sempress-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: sempress-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 46.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e3e81fcad00fca9f813b5cfde55b3948cee3bb194c3b77916cb739a25fe0f345
MD5 a1e826e7577e96047819b11869797dd4
BLAKE2b-256 5bd321fbb07e9a840aba17e8a28c8247d8ca101a36020bc30aee2348594db8f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page