Semantic compression for tabular data and images using vector quantization

These details have not been verified by PyPI

Project links

Project description

Sempress

Semantic Compression API - Reduce Cloud Storage Costs by 90%

Sempress is a compression API service that achieves 5-15× better compression than gzip on numeric-heavy datasets through learned vector quantization. Perfect for IoT telemetry, ML feature stores, and financial data.

Proven Results: 15.72× compression ratio on real IoT data (vs gzip's 2.48×) = 533% improvement

🚀 Quick Start

Python Client (Recommended)

# Install
pip install sempress-client

# Compress
from sempress import SempressClient

client = SempressClient(api_key="sk_live_...")
result = client.compress_file("data.csv")

print(f"Compression: {result.ratio}×")
print(f"Saved: {result.space_saved_pct}%")
print(f"AWS Cost Savings: ${result.monthly_savings}/mo")

# Download compressed file
result.save("data.smp")

REST API (Any Language)

# Compress
curl -X POST https://api.sempress.net/v1/compress \
  -H "Authorization: Bearer sk_live_..." \
  -F "file=@data.csv"

# Response
{
  "job_id": "job_abc123",
  "compression_ratio": 12.5,
  "space_saved_pct": 92.0,
  "aws_savings_monthly": 45.50
}

💰 Pricing

Free Tier

100 MB/month compression
Full API access
Web interface
Community support

Pro Tier - $29/month

10 GB/month (100× more)
Priority processing (2× faster)
Email support
Advanced features

Enterprise - Custom Pricing

Unlimited usage
S3 direct integration
On-premise deployment
24/7 support

→ Sign Up Free

📊 Performance

Tested on 10,000 rows of IoT data (1.4 MB):

Metric	Sempress	gzip	Improvement
Compression Ratio	15.72×	2.48×	+533%
Final Size	93 KB	603 KB	84% smaller
Space Saved	93.64%	59.73%	+57%
Data Fidelity	97.5%	N/A	Configurable

🎯 Use Cases

IoT & Telemetry

Compress sensor data streams by 90%+. Perfect for:

Industrial IoT monitoring
Smart city deployments
Fleet management systems

ML Feature Stores

Reduce S3 costs for training data:

High-dimensional feature vectors
Time-series embeddings
Model training datasets

Financial Data

Archive tick data with precision:

High-frequency trading data
Market microstructure
Historical financial time series

🔧 Core Library (Open Source)

This repository contains sempress-core, the open-source compression library.

Installation

pip install git+https://github.com/jalyper/sempress-core.git

CLI Usage

# Compress
sempress encode --in data.csv --out data.smp --k 64

# Decompress  
sempress decode --in data.smp --out restored.csv

Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Compress
config = EncodeConfig(
    lock_cols=["id", "timestamp"],  # Lossless
    residual_cols=["price"],         # Perfect precision
    k=64                             # Codebook size
)
compressed = encode_csv("data.csv", config)

# Save
with open("data.smp", "wb") as f:
    f.write(compressed)

# Decompress
decode_to_csv(compressed, "restored.csv")

🌐 Sempress.net Service

Live Service: https://sempress.net

The commercial Sempress service provides:

✅ REST API for any language
✅ Python, JavaScript, Go clients
✅ Web interface with analytics
✅ Job tracking & metrics
✅ Usage-based pricing
✅ Enterprise features (S3 integration, batch processing)

Service Code: See /vercel-deploy/ for the production deployment

📖 Documentation

Customer Lifecycle Strategy - Complete product roadmap
Deployment Guide - Launch & scaling plan
Research Paper - Technical details
Image Compression - Image compression features (experimental)

🛠️ Development

For Commercial Service (sempress.net)

See /vercel-deploy/ directory for:

Production website code
API backend implementation
Authentication system
Payment integration

For Core Library Development

# Clone
git clone https://github.com/jalyper/sempress-core.git
cd sempress-core

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run benchmarks
python scripts/run_benchmarks.py

🤝 Contributing

We welcome contributions to the core library!

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

See CONTRIBUTING.md for details.

📄 Research Paper

Sempress: Semantic Compression for Tabular Data via Learned Vector Quantization

PDF: sempress.net/paper.pdf
Published: January 2025
Version: 0.2.0
Author: Keaton Anderson
License: MIT (Open Source)

🗺️ Roadmap

Core Library (Open Source)

CSV compression with vector quantization
CLI tool
Python API
Git LFS plugin
Image compression (experimental)
Parquet support
Arrow support
Streaming compression

Commercial Service (sempress.net)

REST API with authentication
Web interface with analytics
Free & Pro tiers
Python client library (sempress-client)
JavaScript client library
S3 direct integration
Batch processing API
Enterprise on-premise deployment

See CUSTOMER_LIFECYCLE_STRATEGY.md for detailed roadmap.

📧 Contact

Website: sempress.net
Email: hello@sempress.net
GitHub: @jalyper
Issues: GitHub Issues

📜 License

MIT License - See LICENSE for details.

Note: The core compression library is open source. The commercial API service at sempress.net is a hosted offering with additional features.

🙏 Citation

If you use Sempress in your research, please cite:

@software{sempress2025,
  title={Sempress: Semantic Compression for Tabular Data},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}

Built with ❤️ for the data science community

💡 Key Features

Semantic Compression: Learns column-wise patterns using K-Means vector quantization
Lossless Locked Columns: Automatically preserves strings, categoricals, and IDs with 100% fidelity
Optional Residuals: Achieve near-zero error on precision-critical columns (financial, scientific)
Uncertainty Tracking: Flags cells with high quantization error for quality monitoring
Fast Decode: Competitive with gzip+CSV parse (0.9-1.5× overhead)

📖 How It Works

Sempress applies per-column K-Means vector quantization to numeric data:

Column Analysis: Auto-detects numeric vs categorical columns
Learn Codebooks: K-Means learns k=64 centroids per numeric column
Encode to Indices: Replace values with nearest centroid index (uint16)
Add Residuals (optional): Store exact errors for high-precision columns
Package: Msgpack + Zstd container with schema and metadata

Result: Exploit semantic patterns in numeric data instead of treating tables as byte streams.

🛠️ Installation

Requirements

Python 3.10+
pandas, numpy, scikit-learn, msgpack, zstandard

Install from Source

git clone https://github.com/jalyper/sempress.git
cd sempress
pip install -e .

Dependencies

pip install pandas numpy scikit-learn msgpack zstandard

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

📚 Usage Guide

Basic Compression

# Encode CSV to .smp format
sempress encode \
  --in data.csv \
  --out data.smp \
  --lock-cols user_id,timestamp \
  --k 64

Options:

--lock-cols: Columns to preserve losslessly (comma-separated)
--residual-cols: High-precision columns (store exact errors)
--k: Codebook size (default: 64, range: 16-256)
--uncert-thresh: Flag cells with >X relative error (default: 0.2)

Decompression

# Decode .smp back to CSV
sempress decode \
  --in data.smp \
  --out data_reconstructed.csv

Quality Evaluation

# Compare original vs reconstructed
sempress eval \
  --original data.csv \
  --recon data_reconstructed.csv \
  --lock-cols user_id,timestamp

Metrics:

Locked columns: Exact match rate (should be 100%)
Numeric columns: RMSE, MAPE, KS-distance
Uncertainty: % of cells flagged

🐍 Python API

from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Configure encoder
config = EncodeConfig(
    lock_cols=['user_id', 'timestamp'],
    residual_cols=['amount'],
    k=64,
    uncertainty_thresh=0.2
)

# Encode
compressed_blob = encode_csv('data.csv', config)

# Save to file
with open('data.smp', 'wb') as f:
    f.write(compressed_blob)

# Decode
decode_to_csv(compressed_blob, 'reconstructed.csv')

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

📊 Benchmarking

Run comprehensive benchmarks on your data:

# Generate synthetic datasets
python scripts/generate_datasets.py --rows 100000

# Run benchmarks
python scripts/comprehensive_benchmark.py --out results.json

# Generate figures
python scripts/generate_figures.py

Included datasets:

IoT Telemetry (sensor readings)
ML Features (user behavior)
Financial (stock market OHLC)
Sensor Physics (accelerometer, magnetometer)

🔗 Integrations

Git LFS Plugin

Automatic compression for Git repositories - Perfect for ML teams!

Repository: github.com/jalyper/git-lfs-sempress
Features:
- Zero workflow changes (works with git add/commit)
- 8-12× compression on CSV files
- Intelligent quality monitoring
- 15 automated tests, all passing
Installation: pip install git+https://github.com/jalyper/git-lfs-sempress.git

Use Cases:

ML training datasets in Git repos
Data science notebooks with large CSVs
IoT data collection repositories
Collaborative data projects

🎯 When to Use Sempress

✅ Sempress Excels On:

High numeric density (>60% numeric columns)
IoT/sensor data (temperature, pressure, acceleration)
ML feature stores (continuous features for training)
Financial data (tick data, OHLC prices)
Large datasets (>10K rows)

⚠️ Use Gzip Instead For:

Text-heavy tables (<50% numeric)
Small tables (<5K rows)
Real-time streaming (Sempress has higher encode overhead)
High categorical cardinality

📄 Research Paper

Full paper: https://sempress.net/paper.pdf

Citation:

@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  note={Independent research with implementation assistance from AI coding agents},
  url={https://sempress.net}
}

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

🤝 Contributing

We welcome contributions! Areas for improvement:

Streaming ingestion (chunked encoding for >100GB files)
Learned entropy coding (autoregressive priors on index sequences)
Time-series VQ (segment-wise codebooks for temporal data)
Database integrations (PostgreSQL extension, ClickHouse codec)
Text compression (LLM-based semantic tokens for mixed data)

How to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📁 Repository Structure

sempress/
├── src/sempress/           # Core library
│   ├── table_encoder.py    # K-Means VQ encoder
│   ├── table_decoder.py    # Decoder with uncertainty
│   ├── container.py        # Msgpack + Zstd packaging
│   └── cli.py              # Command-line interface
├── scripts/                # Benchmarking & datasets
│   ├── generate_datasets.py
│   ├── comprehensive_benchmark.py
│   └── generate_figures.py
├── data/                   # Sample datasets
├── tests/                  # Unit tests
├── docs/                   # Documentation & paper
└── README.md               # This file

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

🧪 Running Tests

# Install test dependencies
pip install pytest

# Run tests
pytest tests/

# With coverage
pytest --cov=sempress tests/

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

📊 Reproducing Paper Results

# Generate datasets
python scripts/generate_datasets.py

# Run all benchmarks (takes ~10 minutes)
python scripts/comprehensive_benchmark.py

# Generate paper figures
python scripts/generate_figures.py

# Results saved to logs/ and docs/assets/

Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh

Then open: http://localhost:3000

Features:

📤 Upload CSV files (up to 50MB)
📊 Real-time compression with Sempress
⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
💾 Download .smp and reconstructed CSV
📈 Quality analysis with per-column metrics

Deployment Options:

Local (development)
Railway/Render (production)
Docker/Docker Compose
See docs/deployment_guide.md for details

📈 Performance Benchmarks

Encode time (100K rows):

Telemetry: 5.83s
ML Features: 11.20s
Financial: 9.08s

Decode time (100K rows):

Telemetry: 0.28s (1.47× gzip+parse)
ML Features: 0.55s (1.28× gzip+parse)
Financial: 0.28s (1.17× gzip+parse)

Memory usage:

Peak during encode: 2-3× original file size
Peak during decode: 1.5-2× original file size

🐛 Known Issues

In-memory processing: Files must fit in RAM (working on streaming)
Fixed k per column: No adaptive sizing yet
CSV-only: Parquet/Arrow support coming soon

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🌟 Star History

If you find Sempress useful, please star the repository! ⭐

📞 Contact

Website: https://sempress.net
Paper: https://sempress.net/paper.pdf
Issues: GitHub Issues
Email: research@sempress.net

🙏 Acknowledgments

Independent research (no external funding).

Built with: Python, pandas, numpy, scikit-learn, msgpack, zstandard

Made with ❤️ for the data compression community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.2

Mar 13, 2026

0.3.1

Mar 13, 2026

This version

0.3.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sempress-0.3.0.tar.gz (57.3 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sempress-0.3.0-py3-none-any.whl (51.0 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file sempress-0.3.0.tar.gz.

File metadata

Download URL: sempress-0.3.0.tar.gz
Upload date: Mar 13, 2026
Size: 57.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`afeaabc5b40f42e6b57aa33b7db95d9dca01b0ac8b76cb497888b720beb78d97`
MD5	`36f7f6fa1d3ffaeaca1acde62b82fe4f`
BLAKE2b-256	`40a09500ca51d2495c23931dac8924c2de63a8485812d0a817d5f1171e98bc80`

See more details on using hashes here.

File details

Details for the file sempress-0.3.0-py3-none-any.whl.

File metadata

Download URL: sempress-0.3.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 51.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sempress-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8d91837c02a43128f0d5f3e96aa10e5fa70397e3447c1742b4d18ab9204c1a9`
MD5	`e0c920175e43f69435e7a57dff2b703a`
BLAKE2b-256	`c7f4aead565118a1caad1a22e9a50cb23372c4fd96125d1981a7c37f2a9001be`

See more details on using hashes here.

sempress 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sempress

🚀 Quick Start

Python Client (Recommended)

REST API (Any Language)

💰 Pricing

Free Tier

Pro Tier - $29/month

Enterprise - Custom Pricing

📊 Performance

🎯 Use Cases

IoT & Telemetry

ML Feature Stores

Financial Data

🔧 Core Library (Open Source)

Installation

CLI Usage

Python API

🌐 Sempress.net Service

📖 Documentation

🛠️ Development

For Commercial Service (sempress.net)

For Core Library Development

🤝 Contributing

📄 Research Paper

🗺️ Roadmap

Core Library (Open Source)

Commercial Service (sempress.net)

📧 Contact

📜 License

🙏 Citation

💡 Key Features

📖 How It Works

🛠️ Installation

Requirements

Install from Source

Dependencies

Option 3: Interactive Web Platform

📚 Usage Guide

Basic Compression

Decompression

Quality Evaluation

🐍 Python API

Option 3: Interactive Web Platform

📊 Benchmarking

🔗 Integrations

Git LFS Plugin

🎯 When to Use Sempress

✅ Sempress Excels On:

⚠️ Use Gzip Instead For:

📄 Research Paper

Option 3: Interactive Web Platform

🤝 Contributing

📁 Repository Structure

Option 3: Interactive Web Platform

🧪 Running Tests

Option 3: Interactive Web Platform

📊 Reproducing Paper Results

Option 3: Interactive Web Platform

📈 Performance Benchmarks

🐛 Known Issues

📜 License

🌟 Star History

📞 Contact

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers