Skip to main content

Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)

Project description

noLZSS

Build Wheels Documentation noLZSS Logo

Non-overlapping Lempel–Ziv–Storer–Szymanski factorization

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Features

  • 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
  • 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
  • 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
  • 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
  • 🔧 Flexible API: Support for both strings and files with optional performance hints
  • 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
  • C++ Extensions: High-performance C++ implementations for memory-intensive operations

Installation

From Source (Development)

git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .

Requirements

  • Python 3.8+
  • C++17 compatible compiler
  • CMake 3.20+

Quick Start

Basic Usage

import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]

Working with Files

# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")

Genomics Applications

import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")

Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

Local Batch Processing

# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4

LSF Cluster Batch Processing

# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement

The LSF batch processor:

  • Estimates time, memory, and disk requirements from benchmark data
  • Allocates optimal threads per job based on file size
  • Submits jobs with bsub and tracks completion
  • Provides consolidated failure reports without log spam

See LSF Batch Factorizer Documentation for details.

Algorithm Details

The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:

  • Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
  • Range Minimum Queries: For efficient lowest common ancestor computations
  • Sink-based Processing: Memory-efficient processing using callback functions

Performance

  • Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
  • Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
  • Memory Usage: File-based processing uses minimal memory for large files
  • C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations

Documentation

Complete documentation is available at omerkerner.github.io/noLZSS

The documentation includes:

  • Python API Reference: Complete Python API with examples and parameter descriptions
  • C++ API Reference: Auto-generated C++ API documentation from source code
  • Genomics Module: Specialized functions for biological sequence analysis
  • Examples and Tutorials: Comprehensive usage examples and best practices

Development

Building from Source

# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

License

This project is licensed under the BSD 3-Clause License (see LICENSE).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.

Citation

If you use this library in your research, please cite:

@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nolzss-0.3.2.tar.gz (5.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nolzss-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (624.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nolzss-0.3.2-cp312-cp312-macosx_11_0_arm64.whl (399.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nolzss-0.3.2-cp312-cp312-macosx_10_15_x86_64.whl (442.8 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

nolzss-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (625.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nolzss-0.3.2-cp311-cp311-macosx_11_0_arm64.whl (398.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nolzss-0.3.2-cp311-cp311-macosx_10_15_x86_64.whl (440.9 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

nolzss-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

nolzss-0.3.2-cp310-cp310-macosx_11_0_arm64.whl (397.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nolzss-0.3.2-cp310-cp310-macosx_10_15_x86_64.whl (439.6 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

nolzss-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (624.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

nolzss-0.3.2-cp39-cp39-macosx_11_0_arm64.whl (397.9 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nolzss-0.3.2-cp39-cp39-macosx_10_15_x86_64.whl (439.8 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

nolzss-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.7 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

nolzss-0.3.2-cp38-cp38-macosx_11_0_arm64.whl (397.7 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nolzss-0.3.2-cp38-cp38-macosx_10_15_x86_64.whl (439.4 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

File details

Details for the file nolzss-0.3.2.tar.gz.

File metadata

  • Download URL: nolzss-0.3.2.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nolzss-0.3.2.tar.gz
Algorithm Hash digest
SHA256 976e80b8abc2c691bb3373c2ee8c5bdce2fcc14b4b297467550c229270a77d78
MD5 994e03f0f1152c15e4bd821c5de2ef1e
BLAKE2b-256 d022039a1057a777b8bd74d14c5b328b93211f39ff91fd73919b03e1b8a6f6f7

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e3ba772daa71c7387e3afde02f0166b6e4497358f6e88bbf62625631cdf1a1e
MD5 f55cd7b5943046ee0e00089006d7fa81
BLAKE2b-256 f52b1aa351468ff365f80573f0fb3a1369fce8bb104cef24369a11a2e692fc62

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 88ebf04a554965279e46f45e3eaec54493534931676f7065dc4cf7a97259138d
MD5 a57f7e4fcf9cbdaae68bf16c7e3e43d4
BLAKE2b-256 dd7c7dd534bd27e0d11872ee14d82225860ebf94bbd46e0843e0e5809fc000d1

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 02fdefcb91c9f7986f9203eb1deda9a24679a3a68a3f8dcf05701c60ae755064
MD5 996c2e8fdaa6b5b770712b92a9ad688d
BLAKE2b-256 19fdd6d0eb2fc2bdd11bf2db9a937eb398ed9618eeefcb8faad4ef5ef30211e7

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b5251191b004c5ef0ccc982d13f871af0238e6850ceacdac765f22b3b3fd99c8
MD5 8b28486615a0961f5ad7431f65ec00d0
BLAKE2b-256 4074510535b190242a0ee45b560811574cc64d8914d5139d1d7bf43d2e907f61

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e5b91afc4266000d0fb6c22baf1d41ea6af6758e6a4c2f2f8ecf2571dbd941d9
MD5 bcb02021cfa3a2c5b204b5bb038daeea
BLAKE2b-256 660c9c2909a4a1f14bb038658ad78bb66c56cdb203244d2f3b260e77ed25316b

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cb2b567695c1443275f28b78c0031bd8820bcf347b380d405bf6c55ec65c8768
MD5 831a959eabd7eda746862c878590c6ca
BLAKE2b-256 7cdd66b2353a27a6fcfca597f85558a069b999fc616aaf4f668f29e25916334b

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1efb20b4ed489e103f5b634e48da0c502ab70c9778d35eef57aa6e0fbc89a471
MD5 46dee8c98ce84914cf3b06e345c7dac1
BLAKE2b-256 d5f662425f5ce8086d72695732349e862604eb320f95dfa024325dc2b7a085ce

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a978b2e59e305ed7c46935e59446f613ec4221e737c53092ffff6c65b1d8b608
MD5 6542a8462ef3c5f75ac582a405a40a60
BLAKE2b-256 2e0059cc819390d0c0dce453eab5ffcac0cf61a01fb8505c26269a881018bfd5

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4c7d81679aa51ebe86733950179b9522ec54d04c3deadcc8640273b5b2e0f5e6
MD5 26db74b999a88462b314883c1a56f259
BLAKE2b-256 d386b3318f92d34841f25f5059dbd2ed544fa7d10159f86dbd3c7184f65c9324

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a83beec74656c5df7994537ccbca7e3f464184c6bc802e2b759b385d59418ed8
MD5 5645e3a5f2247cfaa2515cecc688f327
BLAKE2b-256 fbbd922509fe66f8ab89ef5cc4a3c3fee15cba9fc8b51e57afe48302d6362a95

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a133ab5e5b93018a5aebecdd85f96884dd58713a0738646a6f975a68c74ef0f0
MD5 cea6bf7e5efe6e404e02c3eff834e92a
BLAKE2b-256 8a472e66b01b5b68650113a33cd1152ac46d8efd2b07b3671391ad997a33f87f

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ff510124003f55c2106a10c8d87d151272f8f03ca25376d3bacf6bb2c898ea2f
MD5 a7e701fef1110228253c2a9015780baa
BLAKE2b-256 07d6e793ff9734a2e7df0285240f52a6cd2c4792fcd4033134aed6c7df119aa9

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7aaebb02e072edc2e695482a23fa5e35dbc6c2f0faae31fe19865357d1c98d97
MD5 1b1d547eb5ea341ab94e6a78f7133791
BLAKE2b-256 e5a4a4aaaa219b3e72c60866c178c301473c2011fbb11b70da3cd03a37a8e831

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 637be0396a0ad8788e67156bb7228709e4757011104284396ef3d81be1a2b158
MD5 ce7e1e6d0e73ee5c408ce3ddcc14a871
BLAKE2b-256 a3e434567f7c5b3f0d6ccec3a8867eead2729e67f5e19a71f871b072d353f3f0

See more details on using hashes here.

File details

Details for the file nolzss-0.3.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-0.3.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d0b53241f0da0514b5a20496aac29a9cbd0f2f9003d7133652a907dca4d0ce23
MD5 e14706d243e38db539eff792179cafd7
BLAKE2b-256 37aeea59e3909a7d9532ffecc6dfb62675f1c83ceb963439ae78d2ec90718d7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page