Semantic compression for tabular data and images using vector quantization
Project description
Sempress
Semantic Compression API - Reduce Cloud Storage Costs by 90%
Sempress is a compression API service that achieves 5-15ร better compression than gzip on numeric-heavy datasets through learned vector quantization. Perfect for IoT telemetry, ML feature stores, and financial data.
Proven Results: 15.72ร compression ratio on real IoT data (vs gzip's 2.48ร) = 533% improvement
๐ Quick Start
Python Client (Recommended)
# Install
pip install sempress-client
# Compress
from sempress import SempressClient
client = SempressClient(api_key="sk_live_...")
result = client.compress_file("data.csv")
print(f"Compression: {result.ratio}ร")
print(f"Saved: {result.space_saved_pct}%")
print(f"AWS Cost Savings: ${result.monthly_savings}/mo")
# Download compressed file
result.save("data.smp")
REST API (Any Language)
# Compress
curl -X POST https://api.sempress.net/v1/compress \
-H "Authorization: Bearer sk_live_..." \
-F "file=@data.csv"
# Response
{
"job_id": "job_abc123",
"compression_ratio": 12.5,
"space_saved_pct": 92.0,
"aws_savings_monthly": 45.50
}
๐ฐ Pricing
Free Tier
- 100 MB/month compression
- Full API access
- Web interface
- Community support
Pro Tier - $29/month
- 10 GB/month (100ร more)
- Priority processing (2ร faster)
- Email support
- Advanced features
Enterprise - Custom Pricing
- Unlimited usage
- S3 direct integration
- On-premise deployment
- 24/7 support
๐ Performance
Tested on 10,000 rows of IoT data (1.4 MB):
| Metric | Sempress | gzip | Improvement |
|---|---|---|---|
| Compression Ratio | 15.72ร | 2.48ร | +533% |
| Final Size | 93 KB | 603 KB | 84% smaller |
| Space Saved | 93.64% | 59.73% | +57% |
| Data Fidelity | 97.5% | N/A | Configurable |
๐ฏ Use Cases
IoT & Telemetry
Compress sensor data streams by 90%+. Perfect for:
- Industrial IoT monitoring
- Smart city deployments
- Fleet management systems
ML Feature Stores
Reduce S3 costs for training data:
- High-dimensional feature vectors
- Time-series embeddings
- Model training datasets
Financial Data
Archive tick data with precision:
- High-frequency trading data
- Market microstructure
- Historical financial time series
๐ง Core Library (Open Source)
This repository contains sempress-core, the open-source compression library.
Installation
pip install git+https://github.com/jalyper/sempress-core.git
CLI Usage
# Compress
sempress encode --in data.csv --out data.smp --k 64
# Decompress
sempress decode --in data.smp --out restored.csv
Python API
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig
# Compress
config = EncodeConfig(
lock_cols=["id", "timestamp"], # Lossless
residual_cols=["price"], # Perfect precision
k=64 # Codebook size
)
compressed = encode_csv("data.csv", config)
# Save
with open("data.smp", "wb") as f:
f.write(compressed)
# Decompress
decode_to_csv(compressed, "restored.csv")
๐ Sempress.net Service
Live Service: https://sempress.net
The commercial Sempress service provides:
- โ REST API for any language
- โ Python, JavaScript, Go clients
- โ Web interface with analytics
- โ Job tracking & metrics
- โ Usage-based pricing
- โ Enterprise features (S3 integration, batch processing)
Service Code: See /vercel-deploy/ for the production deployment
๐ Documentation
- Customer Lifecycle Strategy - Complete product roadmap
- Deployment Guide - Launch & scaling plan
- Research Paper - Technical details
- Image Compression - Image compression features (experimental)
๐ ๏ธ Development
For Commercial Service (sempress.net)
See /vercel-deploy/ directory for:
- Production website code
- API backend implementation
- Authentication system
- Payment integration
For Core Library Development
# Clone
git clone https://github.com/jalyper/sempress-core.git
cd sempress-core
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run benchmarks
python scripts/run_benchmarks.py
๐ค Contributing
We welcome contributions to the core library!
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
See CONTRIBUTING.md for details.
๐ Research Paper
Sempress: Semantic Compression for Tabular Data via Learned Vector Quantization
- PDF: sempress.net/paper.pdf
- Published: January 2025
- Version: 0.2.0
- Author: Keaton Anderson
- License: MIT (Open Source)
๐บ๏ธ Roadmap
Core Library (Open Source)
- CSV compression with vector quantization
- CLI tool
- Python API
- Git LFS plugin
- Image compression (experimental)
- Parquet support
- Arrow support
- Streaming compression
Commercial Service (sempress.net)
- REST API with authentication
- Web interface with analytics
- Free & Pro tiers
- Python client library (
sempress-client) - JavaScript client library
- S3 direct integration
- Batch processing API
- Enterprise on-premise deployment
See CUSTOMER_LIFECYCLE_STRATEGY.md for detailed roadmap.
๐ง Contact
- Website: sempress.net
- Email: hello@sempress.net
- GitHub: @jalyper
- Issues: GitHub Issues
๐ License
MIT License - See LICENSE for details.
Note: The core compression library is open source. The commercial API service at sempress.net is a hosted offering with additional features.
๐ Citation
If you use Sempress in your research, please cite:
@software{sempress2025,
title={Sempress: Semantic Compression for Tabular Data},
author={Anderson, Keaton},
year={2025},
url={https://sempress.net}
}
Built with โค๏ธ for the data science community
๐ก Key Features
- Semantic Compression: Learns column-wise patterns using K-Means vector quantization
- Lossless Locked Columns: Automatically preserves strings, categoricals, and IDs with 100% fidelity
- Optional Residuals: Achieve near-zero error on precision-critical columns (financial, scientific)
- Uncertainty Tracking: Flags cells with high quantization error for quality monitoring
- Fast Decode: Competitive with gzip+CSV parse (0.9-1.5ร overhead)
๐ How It Works
Sempress applies per-column K-Means vector quantization to numeric data:
- Column Analysis: Auto-detects numeric vs categorical columns
- Learn Codebooks: K-Means learns k=64 centroids per numeric column
- Encode to Indices: Replace values with nearest centroid index (uint16)
- Add Residuals (optional): Store exact errors for high-precision columns
- Package: Msgpack + Zstd container with schema and metadata
Result: Exploit semantic patterns in numeric data instead of treating tables as byte streams.
๐ ๏ธ Installation
Requirements
- Python 3.10+
- pandas, numpy, scikit-learn, msgpack, zstandard
Install from Source
git clone https://github.com/jalyper/sempress.git
cd sempress
pip install -e .
Dependencies
pip install pandas numpy scikit-learn msgpack zstandard
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐ Usage Guide
Basic Compression
# Encode CSV to .smp format
sempress encode \
--in data.csv \
--out data.smp \
--lock-cols user_id,timestamp \
--k 64
Options:
--lock-cols: Columns to preserve losslessly (comma-separated)--residual-cols: High-precision columns (store exact errors)--k: Codebook size (default: 64, range: 16-256)--uncert-thresh: Flag cells with >X relative error (default: 0.2)
Decompression
# Decode .smp back to CSV
sempress decode \
--in data.smp \
--out data_reconstructed.csv
Quality Evaluation
# Compare original vs reconstructed
sempress eval \
--original data.csv \
--recon data_reconstructed.csv \
--lock-cols user_id,timestamp
Metrics:
- Locked columns: Exact match rate (should be 100%)
- Numeric columns: RMSE, MAPE, KS-distance
- Uncertainty: % of cells flagged
๐ Python API
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig
# Configure encoder
config = EncodeConfig(
lock_cols=['user_id', 'timestamp'],
residual_cols=['amount'],
k=64,
uncertainty_thresh=0.2
)
# Encode
compressed_blob = encode_csv('data.csv', config)
# Save to file
with open('data.smp', 'wb') as f:
f.write(compressed_blob)
# Decode
decode_to_csv(compressed_blob, 'reconstructed.csv')
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐ Benchmarking
Run comprehensive benchmarks on your data:
# Generate synthetic datasets
python scripts/generate_datasets.py --rows 100000
# Run benchmarks
python scripts/comprehensive_benchmark.py --out results.json
# Generate figures
python scripts/generate_figures.py
Included datasets:
- IoT Telemetry (sensor readings)
- ML Features (user behavior)
- Financial (stock market OHLC)
- Sensor Physics (accelerometer, magnetometer)
๐ Integrations
Git LFS Plugin
Automatic compression for Git repositories - Perfect for ML teams!
- Repository: github.com/jalyper/git-lfs-sempress
- Features:
- Zero workflow changes (works with git add/commit)
- 8-12ร compression on CSV files
- Intelligent quality monitoring
- 15 automated tests, all passing
- Installation:
pip install git+https://github.com/jalyper/git-lfs-sempress.git
Use Cases:
- ML training datasets in Git repos
- Data science notebooks with large CSVs
- IoT data collection repositories
- Collaborative data projects
๐ฏ When to Use Sempress
โ Sempress Excels On:
- High numeric density (>60% numeric columns)
- IoT/sensor data (temperature, pressure, acceleration)
- ML feature stores (continuous features for training)
- Financial data (tick data, OHLC prices)
- Large datasets (>10K rows)
โ ๏ธ Use Gzip Instead For:
- Text-heavy tables (<50% numeric)
- Small tables (<5K rows)
- Real-time streaming (Sempress has higher encode overhead)
- High categorical cardinality
๐ Research Paper
Full paper: https://sempress.net/paper.pdf
Citation:
@article{sempress2025,
title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
author={Anderson, Keaton},
year={2025},
note={Independent research with implementation assistance from AI coding agents},
url={https://sempress.net}
}
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐ค Contributing
We welcome contributions! Areas for improvement:
- Streaming ingestion (chunked encoding for >100GB files)
- Learned entropy coding (autoregressive priors on index sequences)
- Time-series VQ (segment-wise codebooks for temporal data)
- Database integrations (PostgreSQL extension, ClickHouse codec)
- Text compression (LLM-based semantic tokens for mixed data)
How to contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ Repository Structure
sempress/
โโโ src/sempress/ # Core library
โ โโโ table_encoder.py # K-Means VQ encoder
โ โโโ table_decoder.py # Decoder with uncertainty
โ โโโ container.py # Msgpack + Zstd packaging
โ โโโ cli.py # Command-line interface
โโโ scripts/ # Benchmarking & datasets
โ โโโ generate_datasets.py
โ โโโ comprehensive_benchmark.py
โ โโโ generate_figures.py
โโโ data/ # Sample datasets
โโโ tests/ # Unit tests
โโโ docs/ # Documentation & paper
โโโ README.md # This file
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐งช Running Tests
# Install test dependencies
pip install pytest
# Run tests
pytest tests/
# With coverage
pytest --cov=sempress tests/
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐ Reproducing Paper Results
# Generate datasets
python scripts/generate_datasets.py
# Run all benchmarks (takes ~10 minutes)
python scripts/comprehensive_benchmark.py
# Generate paper figures
python scripts/generate_figures.py
# Results saved to logs/ and docs/assets/
Option 3: Interactive Web Platform
Run the full web platform locally with file upload, baseline comparisons, and downloads:
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
Then open: http://localhost:3000
Features:
- ๐ค Upload CSV files (up to 50MB)
- ๐ Real-time compression with Sempress
- โ๏ธ Compare against GZIP, BZ2, LZMA, ZSTD
- ๐พ Download .smp and reconstructed CSV
- ๐ Quality analysis with per-column metrics
Deployment Options:
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See docs/deployment_guide.md for details
๐ Performance Benchmarks
Encode time (100K rows):
- Telemetry: 5.83s
- ML Features: 11.20s
- Financial: 9.08s
Decode time (100K rows):
- Telemetry: 0.28s (1.47ร gzip+parse)
- ML Features: 0.55s (1.28ร gzip+parse)
- Financial: 0.28s (1.17ร gzip+parse)
Memory usage:
- Peak during encode: 2-3ร original file size
- Peak during decode: 1.5-2ร original file size
๐ Known Issues
- In-memory processing: Files must fit in RAM (working on streaming)
- Fixed k per column: No adaptive sizing yet
- CSV-only: Parquet/Arrow support coming soon
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Star History
If you find Sempress useful, please star the repository! โญ
๐ Contact
- Website: https://sempress.net
- Paper: https://sempress.net/paper.pdf
- Issues: GitHub Issues
- Email: research@sempress.net
๐ Acknowledgments
Independent research (no external funding).
Built with: Python, pandas, numpy, scikit-learn, msgpack, zstandard
Made with โค๏ธ for the data compression community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sempress-0.3.0.tar.gz.
File metadata
- Download URL: sempress-0.3.0.tar.gz
- Upload date:
- Size: 57.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afeaabc5b40f42e6b57aa33b7db95d9dca01b0ac8b76cb497888b720beb78d97
|
|
| MD5 |
36f7f6fa1d3ffaeaca1acde62b82fe4f
|
|
| BLAKE2b-256 |
40a09500ca51d2495c23931dac8924c2de63a8485812d0a817d5f1171e98bc80
|
File details
Details for the file sempress-0.3.0-py3-none-any.whl.
File metadata
- Download URL: sempress-0.3.0-py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8d91837c02a43128f0d5f3e96aa10e5fa70397e3447c1742b4d18ab9204c1a9
|
|
| MD5 |
e0c920175e43f69435e7a57dff2b703a
|
|
| BLAKE2b-256 |
c7f4aead565118a1caad1a22e9a50cb23372c4fd96125d1981a7c37f2a9001be
|