Skip to main content

Semantic compression for tabular data and images using vector quantization

Project description

Sempress

Semantic compression for tabular data and images using vector quantization

PyPI License Python

Sempress achieves 5-15x better compression than gzip on numeric-heavy datasets by learning per-column codebooks with K-Means vector quantization. String and ID columns are preserved losslessly; precision-critical columns can store exact residuals.

Installation

pip install sempress

Optional extras:

pip install sempress[image]   # PSNR/SSIM metrics (scikit-image, scipy)
pip install sempress[audio]   # Audio compression (librosa, soundfile)
pip install sempress[api]     # FastAPI server (fastapi, uvicorn)
pip install sempress[all]     # Everything

CLI Usage

# Compress CSV to .smp format
sempress encode --in data.csv --out data.smp --lock-cols id,timestamp --k 64

# Decompress back to CSV
sempress decode --in data.smp --out restored.csv

# Evaluate reconstruction quality
sempress eval --original data.csv --recon restored.csv --lock-cols id,timestamp

Options:

  • --lock-cols: Columns preserved losslessly (strings, IDs, timestamps)
  • --residual-cols: Columns with exact error stored (financial, scientific)
  • --k: Codebook size per column (default: 64, range: 16-256)
  • --uncert-thresh: Flag cells with relative error above threshold (default: 0.2)

Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

config = EncodeConfig(
    lock_cols=["id", "timestamp"],
    residual_cols=["amount"],
    k=64,
    uncertainty_thresh=0.2,
)

# Compress
blob = encode_csv("data.csv", config)
with open("data.smp", "wb") as f:
    f.write(blob)

# Decompress
decode_to_csv(blob, "restored.csv")

How It Works

  1. Column analysis - auto-detects numeric vs categorical columns
  2. Learn codebooks - K-Means learns k centroids per numeric column
  3. Encode to indices - replaces values with nearest centroid index (uint16)
  4. Add residuals (optional) - stores exact errors for high-precision columns
  5. Package - msgpack + zstd container (.smp / SEMZ1 format) with schema and metadata

Benchmarks

Tested on 10,000 rows of IoT sensor data (1.4 MB):

Metric Sempress gzip Improvement
Compression Ratio 15.72x 2.48x +533%
Final Size 93 KB 603 KB 84% smaller
Data Fidelity 97.5% 100% (lossless) Configurable

Sempress excels on numeric-heavy data (IoT, ML features, financial). For text-heavy or very small tables, gzip may be simpler.

Git LFS Integration

For automatic compression in Git repositories, see the companion plugin: git-lfs-sempress

Research Paper

sempress.net/paper.pdf

@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sempress-0.3.1.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sempress-0.3.1-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file sempress-0.3.1.tar.gz.

File metadata

  • Download URL: sempress-0.3.1.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.1.tar.gz
Algorithm Hash digest
SHA256 35e2ebf1c07646780f9bfbc5cb7b4743009277cf589ee2514baeefbff2b19ba6
MD5 4156bd7c54a262c04a087b8a3297c1fa
BLAKE2b-256 035c84fd8304a0c06a01d1e0404f6e1552fd526f1a251d1dac08103a458d8b1d

See more details on using hashes here.

File details

Details for the file sempress-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: sempress-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 46.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 70d877ec280bf43172398a5960c301d7d35b72caaf78749baa22f491aea1b910
MD5 e6a4f823fd0ceeede0173258b8619f7c
BLAKE2b-256 b4b41fd12108d1730da00983edf35c5efe6c0d5afd49ff2eb38b8934372935df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page