Skip to main content

Git LFS filter for semantic compression of CSV files

Project description

git-lfs-sempress

Automatic semantic compression for CSV files in Git repositories

PyPI License: MIT Python 3.10+

A Git LFS clean/smudge filter that compresses CSV files 8-12x using sempress vector quantization. Zero workflow changes -- just git add and git commit as usual.

Quick Start

# Install Git LFS first
git lfs install

# Install the plugin
pip install git-lfs-sempress

# Initialize in your repo
git lfs-sempress init

# Track CSV files
echo "*.csv filter=lfs-sempress diff=lfs merge=lfs -text" >> .gitattributes

# Use Git normally - compression happens automatically
git add data.csv
git commit -m "Add training data"

How It Works

  1. git add -- Sempress compresses CSV to .smp format (clean filter)
  2. Git LFS -- stores the compressed blob
  3. git checkout -- Sempress decompresses back to CSV (smudge filter)
  4. You see -- the original CSV file, seamlessly

Compression Results

$ git add training_data.csv
[sempress] Compressed: 4.0MB -> 471KB (8.5x ratio)

Typical ratios on real data:

  • IoT sensor data: 11.8x
  • Financial OHLC: 8.5x
  • ML feature vectors: 6-10x

Configuration

Create .sempress.yml in your repository root:

version: 1

compression:
  k: 64
  uncertainty_threshold: 0.2
  auto_lock: true
  lock_cols:
    - id
    - timestamp
  residual_cols:
    - amount
    - price

thresholds:
  min_size_mb: 1
  min_compression_ratio: 1.5

Commands

git lfs-sempress init              # Set up filter in current repo
git lfs-sempress track "*.csv"     # Add tracking pattern
git lfs-sempress analyze           # Estimate savings for existing files
git lfs-sempress stats             # Show compression stats for repo
git lfs-sempress quality a.csv b.csv  # Compare original vs reconstructed

Quality Assurance

  • String/ID columns: 100% exact match (automatically locked)
  • Numeric columns: < 0.1% relative error by default
  • Residual columns: bit-perfect reconstruction

If a column needs higher precision, add it to residual_cols in .sempress.yml.

Installation Notes

Windows: If git lfs-sempress isn't recognized, use:

python -m git_lfs_sempress.cli init

Links

License

MIT License - see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_lfs_sempress-0.1.1.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_lfs_sempress-0.1.1-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file git_lfs_sempress-0.1.1.tar.gz.

File metadata

  • Download URL: git_lfs_sempress-0.1.1.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for git_lfs_sempress-0.1.1.tar.gz
Algorithm Hash digest
SHA256 19f5935529248ca66431768b4f40ea4b305b263ebe2929f78676ad36341f3e93
MD5 f7128860a29a399491fc5c9fb10b6210
BLAKE2b-256 4ccd41144d56dbc7d962e563d7c15d1678170e6ca28946ab3eaf697298b7a234

See more details on using hashes here.

File details

Details for the file git_lfs_sempress-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for git_lfs_sempress-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 31cc34e32f271f78ad1657d3e28165b08521b3d4175f5ecfd4bccdb638c0dc78
MD5 00f3d484a4dbe27c4f9b741689590d67
BLAKE2b-256 bfcad38a6348206ac20ff6b54b423f7c70bb715a0c1d91aabcc0d9ec1d3ca62f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page