Skip to main content

Semantic compression for tabular data and images using vector quantization

Project description

Sempress

Semantic Compression API - Reduce Cloud Storage Costs by 90%

Website API Docs License Python

Sempress is a compression API service that achieves 5-15ร— better compression than gzip on numeric-heavy datasets through learned vector quantization. Perfect for IoT telemetry, ML feature stores, and financial data.

Proven Results: 15.72ร— compression ratio on real IoT data (vs gzip's 2.48ร—) = 533% improvement


๐Ÿš€ Quick Start

Python Client (Recommended)

# Install
pip install sempress-client

# Compress
from sempress import SempressClient

client = SempressClient(api_key="sk_live_...")
result = client.compress_file("data.csv")

print(f"Compression: {result.ratio}ร—")
print(f"Saved: {result.space_saved_pct}%")
print(f"AWS Cost Savings: ${result.monthly_savings}/mo")

# Download compressed file
result.save("data.smp")

REST API (Any Language)

# Compress
curl -X POST https://api.sempress.net/v1/compress \
  -H "Authorization: Bearer sk_live_..." \
  -F "file=@data.csv"

# Response
{
  "job_id": "job_abc123",
  "compression_ratio": 12.5,
  "space_saved_pct": 92.0,
  "aws_savings_monthly": 45.50
}

๐Ÿ’ฐ Pricing

Free Tier

  • 100 MB/month compression
  • Full API access
  • Web interface
  • Community support

Pro Tier - $29/month

  • 10 GB/month (100ร— more)
  • Priority processing (2ร— faster)
  • Email support
  • Advanced features

Enterprise - Custom Pricing

  • Unlimited usage
  • S3 direct integration
  • On-premise deployment
  • 24/7 support

โ†’ Sign Up Free


๐Ÿ“Š Performance

Tested on 10,000 rows of IoT data (1.4 MB):

Metric Sempress gzip Improvement
Compression Ratio 15.72ร— 2.48ร— +533%
Final Size 93 KB 603 KB 84% smaller
Space Saved 93.64% 59.73% +57%
Data Fidelity 97.5% N/A Configurable

๐ŸŽฏ Use Cases

IoT & Telemetry

Compress sensor data streams by 90%+. Perfect for:

  • Industrial IoT monitoring
  • Smart city deployments
  • Fleet management systems

ML Feature Stores

Reduce S3 costs for training data:

  • High-dimensional feature vectors
  • Time-series embeddings
  • Model training datasets

Financial Data

Archive tick data with precision:

  • High-frequency trading data
  • Market microstructure
  • Historical financial time series

๐Ÿ”ง Core Library (Open Source)

This repository contains sempress-core, the open-source compression library.

Installation

pip install git+https://github.com/jalyper/sempress-core.git

CLI Usage

# Compress
sempress encode --in data.csv --out data.smp --k 64

# Decompress  
sempress decode --in data.smp --out restored.csv

Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Compress
config = EncodeConfig(
    lock_cols=["id", "timestamp"],  # Lossless
    residual_cols=["price"],         # Perfect precision
    k=64                             # Codebook size
)
compressed = encode_csv("data.csv", config)

# Save
with open("data.smp", "wb") as f:
    f.write(compressed)

# Decompress
decode_to_csv(compressed, "restored.csv")

๐ŸŒ Sempress.net Service

Live Service: https://sempress.net

The commercial Sempress service provides:

  • โœ… REST API for any language
  • โœ… Python, JavaScript, Go clients
  • โœ… Web interface with analytics
  • โœ… Job tracking & metrics
  • โœ… Usage-based pricing
  • โœ… Enterprise features (S3 integration, batch processing)

Service Code: See /vercel-deploy/ for the production deployment


๐Ÿ“– Documentation


๐Ÿ› ๏ธ Development

For Commercial Service (sempress.net)

See /vercel-deploy/ directory for:

  • Production website code
  • API backend implementation
  • Authentication system
  • Payment integration

For Core Library Development

# Clone
git clone https://github.com/jalyper/sempress-core.git
cd sempress-core

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run benchmarks
python scripts/run_benchmarks.py

๐Ÿค Contributing

We welcome contributions to the core library!

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

See CONTRIBUTING.md for details.


๐Ÿ“„ Research Paper

Sempress: Semantic Compression for Tabular Data via Learned Vector Quantization

  • PDF: sempress.net/paper.pdf
  • Published: January 2025
  • Version: 0.2.0
  • Author: Keaton Anderson
  • License: MIT (Open Source)

๐Ÿ—บ๏ธ Roadmap

Core Library (Open Source)

  • CSV compression with vector quantization
  • CLI tool
  • Python API
  • Git LFS plugin
  • Image compression (experimental)
  • Parquet support
  • Arrow support
  • Streaming compression

Commercial Service (sempress.net)

  • REST API with authentication
  • Web interface with analytics
  • Free & Pro tiers
  • Python client library (sempress-client)
  • JavaScript client library
  • S3 direct integration
  • Batch processing API
  • Enterprise on-premise deployment

See CUSTOMER_LIFECYCLE_STRATEGY.md for detailed roadmap.


๐Ÿ“ง Contact


๐Ÿ“œ License

MIT License - See LICENSE for details.

Note: The core compression library is open source. The commercial API service at sempress.net is a hosted offering with additional features.


๐Ÿ™ Citation

If you use Sempress in your research, please cite:

@software{sempress2025,
  title={Sempress: Semantic Compression for Tabular Data},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}

Built with โค๏ธ for the data science community


๐Ÿ’ก Key Features

  • Semantic Compression: Learns column-wise patterns using K-Means vector quantization
  • Lossless Locked Columns: Automatically preserves strings, categoricals, and IDs with 100% fidelity
  • Optional Residuals: Achieve near-zero error on precision-critical columns (financial, scientific)
  • Uncertainty Tracking: Flags cells with high quantization error for quality monitoring
  • Fast Decode: Competitive with gzip+CSV parse (0.9-1.5ร— overhead)

๐Ÿ“– How It Works

Sempress applies per-column K-Means vector quantization to numeric data:

  1. Column Analysis: Auto-detects numeric vs categorical columns
  2. Learn Codebooks: K-Means learns k=64 centroids per numeric column
  3. Encode to Indices: Replace values with nearest centroid index (uint16)
  4. Add Residuals (optional): Store exact errors for high-precision columns
  5. Package: Msgpack + Zstd container with schema and metadata

Result: Exploit semantic patterns in numeric data instead of treating tables as byte streams.


๐Ÿ› ๏ธ Installation

Requirements

  • Python 3.10+
  • pandas, numpy, scikit-learn, msgpack, zstandard

Install from Source

git clone https://github.com/jalyper/sempress.git
cd sempress
pip install -e .

Dependencies

pip install pandas numpy scikit-learn msgpack zstandard

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿ“š Usage Guide

Basic Compression

# Encode CSV to .smp format
sempress encode \
  --in data.csv \
  --out data.smp \
  --lock-cols user_id,timestamp \
  --k 64

Options:

  • --lock-cols: Columns to preserve losslessly (comma-separated)
  • --residual-cols: High-precision columns (store exact errors)
  • --k: Codebook size (default: 64, range: 16-256)
  • --uncert-thresh: Flag cells with >X relative error (default: 0.2)

Decompression

# Decode .smp back to CSV
sempress decode \
  --in data.smp \
  --out data_reconstructed.csv

Quality Evaluation

# Compare original vs reconstructed
sempress eval \
  --original data.csv \
  --recon data_reconstructed.csv \
  --lock-cols user_id,timestamp

Metrics:

  • Locked columns: Exact match rate (should be 100%)
  • Numeric columns: RMSE, MAPE, KS-distance
  • Uncertainty: % of cells flagged

๐Ÿ Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Configure encoder
config = EncodeConfig(
    lock_cols=['user_id', 'timestamp'],
    residual_cols=['amount'],
    k=64,
    uncertainty_thresh=0.2
)

# Encode
compressed_blob = encode_csv('data.csv', config)

# Save to file
with open('data.smp', 'wb') as f:
    f.write(compressed_blob)

# Decode
decode_to_csv(compressed_blob, 'reconstructed.csv')

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿ“Š Benchmarking

Run comprehensive benchmarks on your data:

# Generate synthetic datasets
python scripts/generate_datasets.py --rows 100000

# Run benchmarks
python scripts/comprehensive_benchmark.py --out results.json

# Generate figures
python scripts/generate_figures.py

Included datasets:

  • IoT Telemetry (sensor readings)
  • ML Features (user behavior)
  • Financial (stock market OHLC)
  • Sensor Physics (accelerometer, magnetometer)

๐Ÿ”— Integrations

Git LFS Plugin

Automatic compression for Git repositories - Perfect for ML teams!

  • Repository: github.com/jalyper/git-lfs-sempress
  • Features:
    • Zero workflow changes (works with git add/commit)
    • 8-12ร— compression on CSV files
    • Intelligent quality monitoring
    • 15 automated tests, all passing
  • Installation: pip install git+https://github.com/jalyper/git-lfs-sempress.git

Use Cases:

  • ML training datasets in Git repos
  • Data science notebooks with large CSVs
  • IoT data collection repositories
  • Collaborative data projects

๐ŸŽฏ When to Use Sempress

โœ… Sempress Excels On:

  • High numeric density (>60% numeric columns)
  • IoT/sensor data (temperature, pressure, acceleration)
  • ML feature stores (continuous features for training)
  • Financial data (tick data, OHLC prices)
  • Large datasets (>10K rows)

โš ๏ธ Use Gzip Instead For:

  • Text-heavy tables (<50% numeric)
  • Small tables (<5K rows)
  • Real-time streaming (Sempress has higher encode overhead)
  • High categorical cardinality

๐Ÿ“„ Research Paper

Full paper: https://sempress.net/paper.pdf

Citation:

@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  note={Independent research with implementation assistance from AI coding agents},
  url={https://sempress.net}
}

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿค Contributing

We welcome contributions! Areas for improvement:

  • Streaming ingestion (chunked encoding for >100GB files)
  • Learned entropy coding (autoregressive priors on index sequences)
  • Time-series VQ (segment-wise codebooks for temporal data)
  • Database integrations (PostgreSQL extension, ClickHouse codec)
  • Text compression (LLM-based semantic tokens for mixed data)

How to contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ Repository Structure

sempress/
โ”œโ”€โ”€ src/sempress/           # Core library
โ”‚   โ”œโ”€โ”€ table_encoder.py    # K-Means VQ encoder
โ”‚   โ”œโ”€โ”€ table_decoder.py    # Decoder with uncertainty
โ”‚   โ”œโ”€โ”€ container.py        # Msgpack + Zstd packaging
โ”‚   โ””โ”€โ”€ cli.py              # Command-line interface
โ”œโ”€โ”€ scripts/                # Benchmarking & datasets
โ”‚   โ”œโ”€โ”€ generate_datasets.py
โ”‚   โ”œโ”€โ”€ comprehensive_benchmark.py
โ”‚   โ””โ”€โ”€ generate_figures.py
โ”œโ”€โ”€ data/                   # Sample datasets
โ”œโ”€โ”€ tests/                  # Unit tests
โ”œโ”€โ”€ docs/                   # Documentation & paper
โ””โ”€โ”€ README.md               # This file

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿงช Running Tests

# Install test dependencies
pip install pytest

# Run tests
pytest tests/

# With coverage
pytest --cov=sempress tests/

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿ“Š Reproducing Paper Results

# Generate datasets
python scripts/generate_datasets.py

# Run all benchmarks (takes ~10 minutes)
python scripts/comprehensive_benchmark.py

# Generate paper figures
python scripts/generate_figures.py

# Results saved to logs/ and docs/assets/

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

  • ๐Ÿ“ค Upload CSV files (up to 50MB)
  • ๐Ÿ“Š Real-time compression with Sempress
  • โš–๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
  • ๐Ÿ’พ Download .smp and reconstructed CSV
  • ๐Ÿ“ˆ Quality analysis with per-column metrics

Deployment Options:


๐Ÿ“ˆ Performance Benchmarks

Encode time (100K rows):

  • Telemetry: 5.83s
  • ML Features: 11.20s
  • Financial: 9.08s

Decode time (100K rows):

  • Telemetry: 0.28s (1.47ร— gzip+parse)
  • ML Features: 0.55s (1.28ร— gzip+parse)
  • Financial: 0.28s (1.17ร— gzip+parse)

Memory usage:

  • Peak during encode: 2-3ร— original file size
  • Peak during decode: 1.5-2ร— original file size

๐Ÿ› Known Issues

  • In-memory processing: Files must fit in RAM (working on streaming)
  • Fixed k per column: No adaptive sizing yet
  • CSV-only: Parquet/Arrow support coming soon

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐ŸŒŸ Star History

If you find Sempress useful, please star the repository! โญ


๐Ÿ“ž Contact


๐Ÿ™ Acknowledgments

Independent research (no external funding).

Built with: Python, pandas, numpy, scikit-learn, msgpack, zstandard


Made with โค๏ธ for the data compression community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sempress-0.3.0.tar.gz (57.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sempress-0.3.0-py3-none-any.whl (51.0 kB view details)

Uploaded Python 3

File details

Details for the file sempress-0.3.0.tar.gz.

File metadata

  • Download URL: sempress-0.3.0.tar.gz
  • Upload date:
  • Size: 57.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.0.tar.gz
Algorithm Hash digest
SHA256 afeaabc5b40f42e6b57aa33b7db95d9dca01b0ac8b76cb497888b720beb78d97
MD5 36f7f6fa1d3ffaeaca1acde62b82fe4f
BLAKE2b-256 40a09500ca51d2495c23931dac8924c2de63a8485812d0a817d5f1171e98bc80

See more details on using hashes here.

File details

Details for the file sempress-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sempress-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8d91837c02a43128f0d5f3e96aa10e5fa70397e3447c1742b4d18ab9204c1a9
MD5 e0c920175e43f69435e7a57dff2b703a
BLAKE2b-256 c7f4aead565118a1caad1a22e9a50cb23372c4fd96125d1981a7c37f2a9001be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page