Skip to main content

Git LFS filter for semantic compression of CSV files

Project description

git-lfs-sempress

Automatic semantic compression for CSV files in Git repositories

PyPI License: MIT Python 3.10+

A Git LFS clean/smudge filter that compresses CSV files 8-12x using sempress vector quantization. Zero workflow changes -- just git add and git commit as usual.

Quick Start

# Install Git LFS first
git lfs install

# Install the plugin
pip install git-lfs-sempress

# Initialize in your repo
git lfs-sempress init

# Track CSV files
echo "*.csv filter=lfs-sempress diff=lfs merge=lfs -text" >> .gitattributes

# Use Git normally - compression happens automatically
git add data.csv
git commit -m "Add training data"

How It Works

  1. git add -- Sempress compresses CSV to .smp format (clean filter)
  2. Git LFS -- stores the compressed blob
  3. git checkout -- Sempress decompresses back to CSV (smudge filter)
  4. You see -- the original CSV file, seamlessly

Compression Results

$ git add training_data.csv
[sempress] Compressed: 4.0MB -> 471KB (8.5x ratio)

Typical ratios on real data:

  • IoT sensor data: 11.8x
  • Financial OHLC: 8.5x
  • ML feature vectors: 6-10x

Configuration

Create .sempress.yml in your repository root:

version: 1

compression:
  k: 64
  uncertainty_threshold: 0.2
  auto_lock: true
  lock_cols:
    - id
    - timestamp
  residual_cols:
    - amount
    - price

thresholds:
  min_size_mb: 1
  min_compression_ratio: 1.5

Commands

git lfs-sempress init              # Set up filter in current repo
git lfs-sempress track "*.csv"     # Add tracking pattern
git lfs-sempress analyze           # Estimate savings for existing files
git lfs-sempress stats             # Show compression stats for repo
git lfs-sempress quality a.csv b.csv  # Compare original vs reconstructed

Quality Assurance

  • String/ID columns: 100% exact match (automatically locked)
  • Numeric columns: < 0.1% relative error by default
  • Residual columns: bit-perfect reconstruction

If a column needs higher precision, add it to residual_cols in .sempress.yml.

Installation Notes

Windows: If git lfs-sempress isn't recognized, use:

python -m git_lfs_sempress.cli init

Links

License

MIT License - see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_lfs_sempress-0.1.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_lfs_sempress-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file git_lfs_sempress-0.1.0.tar.gz.

File metadata

  • Download URL: git_lfs_sempress-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for git_lfs_sempress-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5a1baa466097e2379b72fd2d106e073b24e13e0c42dcb2139dbfb1924f0f095f
MD5 ef5fbf1de53e3ffbb394cdd5acb4b888
BLAKE2b-256 aa72cf304fc11135f85ac634b00aceae323d8df4db0b79699c8ed4a429aab3e5

See more details on using hashes here.

File details

Details for the file git_lfs_sempress-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for git_lfs_sempress-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca2e39bbabdf3898f64ba134e4527464f10594c9d83fbedc68783396a2bd1ad7
MD5 8647f2f5a238dd25b8181949ef7d9c45
BLAKE2b-256 1b8b17845eace8045ce182c4bc87ec26e70e6122c6146592456d44d90b8d43f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page