Git LFS filter for semantic compression of CSV files
Project description
git-lfs-sempress
Automatic semantic compression for CSV files in Git repositories
A Git LFS clean/smudge filter that compresses CSV files 8-12x using sempress vector quantization. Zero workflow changes -- just git add and git commit as usual.
Quick Start
# Install Git LFS first
git lfs install
# Install the plugin
pip install git-lfs-sempress
# Initialize in your repo
git lfs-sempress init
# Track CSV files
echo "*.csv filter=lfs-sempress diff=lfs merge=lfs -text" >> .gitattributes
# Use Git normally - compression happens automatically
git add data.csv
git commit -m "Add training data"
How It Works
git add-- Sempress compresses CSV to.smpformat (clean filter)- Git LFS -- stores the compressed blob
git checkout-- Sempress decompresses back to CSV (smudge filter)- You see -- the original CSV file, seamlessly
Compression Results
$ git add training_data.csv
[sempress] Compressed: 4.0MB -> 471KB (8.5x ratio)
Typical ratios on real data:
- IoT sensor data: 11.8x
- Financial OHLC: 8.5x
- ML feature vectors: 6-10x
Configuration
Create .sempress.yml in your repository root:
version: 1
compression:
k: 64
uncertainty_threshold: 0.2
auto_lock: true
lock_cols:
- id
- timestamp
residual_cols:
- amount
- price
thresholds:
min_size_mb: 1
min_compression_ratio: 1.5
Commands
git lfs-sempress init # Set up filter in current repo
git lfs-sempress track "*.csv" # Add tracking pattern
git lfs-sempress analyze # Estimate savings for existing files
git lfs-sempress stats # Show compression stats for repo
git lfs-sempress quality a.csv b.csv # Compare original vs reconstructed
Quality Assurance
- String/ID columns: 100% exact match (automatically locked)
- Numeric columns: < 0.1% relative error by default
- Residual columns: bit-perfect reconstruction
If a column needs higher precision, add it to residual_cols in .sempress.yml.
Installation Notes
Windows: If git lfs-sempress isn't recognized, use:
python -m git_lfs_sempress.cli init
Links
- sempress library -- the underlying compression engine
- Research paper -- technical details
License
MIT License - see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file git_lfs_sempress-0.1.1.tar.gz.
File metadata
- Download URL: git_lfs_sempress-0.1.1.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19f5935529248ca66431768b4f40ea4b305b263ebe2929f78676ad36341f3e93
|
|
| MD5 |
f7128860a29a399491fc5c9fb10b6210
|
|
| BLAKE2b-256 |
4ccd41144d56dbc7d962e563d7c15d1678170e6ca28946ab3eaf697298b7a234
|
File details
Details for the file git_lfs_sempress-0.1.1-py3-none-any.whl.
File metadata
- Download URL: git_lfs_sempress-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31cc34e32f271f78ad1657d3e28165b08521b3d4175f5ecfd4bccdb638c0dc78
|
|
| MD5 |
00f3d484a4dbe27c4f9b741689590d67
|
|
| BLAKE2b-256 |
bfcad38a6348206ac20ff6b54b423f7c70bb715a0c1d91aabcc0d9ec1d3ca62f
|