Skip to main content

Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)

Project description

noLZSS

Build Wheels Documentation noLZSS Logo

Non-overlapping Lempel–Ziv–Storer–Szymanski factorization

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Features

  • 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
  • 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
  • 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
  • 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
  • 🔧 Flexible API: Support for both strings and files with optional performance hints
  • 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
  • C++ Extensions: High-performance C++ implementations for memory-intensive operations

Installation

From Source (Development)

git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .

Requirements

  • Python 3.8+
  • C++17 compatible compiler
  • CMake 3.20+

Quick Start

Basic Usage

import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]

Working with Files

# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")

Genomics Applications

import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")

Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

Local Batch Processing

# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4

LSF Cluster Batch Processing

# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement

The LSF batch processor:

  • Estimates time, memory, and disk requirements from benchmark data
  • Allocates optimal threads per job based on file size
  • Submits jobs with bsub and tracks completion
  • Provides consolidated failure reports without log spam

See LSF Batch Factorizer Documentation for details.

Algorithm Details

The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:

  • Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
  • Range Minimum Queries: For efficient lowest common ancestor computations
  • Sink-based Processing: Memory-efficient processing using callback functions

Tie-Breaking for DNA Factorization

When factorizing DNA sequences with reverse complement awareness, if both a forward match and a reverse complement match have the same length, the forward match is preferred. Among candidates of the same type, the one with the earliest position wins.

Performance

  • Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
  • Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
  • Memory Usage: File-based processing uses minimal memory for large files
  • C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations

Documentation

Complete documentation is available at omerkerner.github.io/noLZSS

The documentation includes:

  • Python API Reference: Complete Python API with examples and parameter descriptions
  • C++ API Reference: Auto-generated C++ API documentation from source code
  • Genomics Module: Specialized functions for biological sequence analysis
  • Examples and Tutorials: Comprehensive usage examples and best practices

Development

Building from Source

# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

License

This project is licensed under the BSD 3-Clause License (see LICENSE).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.

Citation

If you use this library in your research, please cite:

@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nolzss-1.1.0.tar.gz (5.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nolzss-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nolzss-1.1.0-cp312-cp312-macosx_11_0_arm64.whl (399.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nolzss-1.1.0-cp312-cp312-macosx_10_15_x86_64.whl (443.0 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

nolzss-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nolzss-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (398.5 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nolzss-1.1.0-cp311-cp311-macosx_10_15_x86_64.whl (440.9 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

nolzss-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

nolzss-1.1.0-cp310-cp310-macosx_11_0_arm64.whl (397.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nolzss-1.1.0-cp310-cp310-macosx_10_15_x86_64.whl (439.6 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

nolzss-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

nolzss-1.1.0-cp39-cp39-macosx_11_0_arm64.whl (397.2 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nolzss-1.1.0-cp39-cp39-macosx_10_15_x86_64.whl (439.7 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

nolzss-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622.8 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

nolzss-1.1.0-cp38-cp38-macosx_11_0_arm64.whl (397.0 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nolzss-1.1.0-cp38-cp38-macosx_10_15_x86_64.whl (439.4 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

File details

Details for the file nolzss-1.1.0.tar.gz.

File metadata

  • Download URL: nolzss-1.1.0.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nolzss-1.1.0.tar.gz
Algorithm Hash digest
SHA256 27bd293307ab6e7d2c87f743c5068d63376baca9d1288615a753f8212a6d9fb0
MD5 0abeb3cbd7c20b2688e801d11efd05bf
BLAKE2b-256 33b41da7aabce5c710a22c4e06d2ea2de5bf4d749f10adb68ecf73f59928f3ed

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 baad125405dc28dae1218b36ce7c858a715e1fe51e97cb1bb394bfe59b56b16f
MD5 3d5963e7e16e42b46f42afe4c104bdac
BLAKE2b-256 0db6558d3d5a54a13f8df426171b36989e29a2424c7598ff76caea5369616887

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 29ea4b2e2e26673a87910f66675737cb982a487abf505129b2e93d22f1f6f5bf
MD5 4eb4b64ad6248b51a67a0050602b05d8
BLAKE2b-256 1bf475926e0aac4af9275ba1524321b4c2e43c1ef4cd768737691ac3d355e08d

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 73901f7c4723ec60a58f9c7e0b117c3608ec9ca428356b610a7bc7e868e88318
MD5 c554f0f4a679013a7ffe2a77c27047e1
BLAKE2b-256 b05cb0f2a36b57d6a9cc1ad622af75bcc70af38a7c3219cd12e52fca88d3543f

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6934945c2fbf17d63c9182043353d44f08108231ead2f764878593f9f42ab3a7
MD5 5b0ec76ab76e3857d7f1db47bfbab422
BLAKE2b-256 574362690895be34595ec4f8e72e29de69c0eda9efa5e3c9eda373abe14b69ca

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6597cabfe3535ad954f75c987efaec9639e799e983e6ba278215ef120e3d4d58
MD5 1491f58485c0e866dd56433da18bc845
BLAKE2b-256 0058f3d88a836d11a579e2c3d4b607e30c559ead8970ccb1f88efcc5d6d52eeb

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 a447c70b68930aacca8693bfdeab018a7bc7d4f108ca7169a3b320cd02633283
MD5 b1aa56c76e8c75002926577665cc869b
BLAKE2b-256 824db5b37de3195e668dd6de343f61875ef893cb14821b88acdf0ae83f824e36

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03c6de93acf77f24821b6487b61501e637dc079583e6823ba6555e47278df559
MD5 c4acfade8d215c04cb2e96dd2fc109fb
BLAKE2b-256 24755a761c472074d48b2e34dd1eff246b8b23f4292a9b1639b0643c314db242

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cec9d750ec790aca77cc12285bbd0e5ddc8b964ac7b9ecfa9fe8d9c2a2f72842
MD5 0a881d72cb5c38371ffb6a91b4620947
BLAKE2b-256 6d1770d6bf044f0cfba3e9c8046f391710fc0a966a0efbebd7dc91b4ea7ccdeb

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d7cb61ca276db1d486507b50a46a71456ea7697ff51fd6391d99f5840234a79d
MD5 7575bf5ec3393663715e2e709be67e0d
BLAKE2b-256 260ee5582e6e3271506cc7a48cee07dfbab60d26a201d64e3ea368f7e1277ed9

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fa0de41839d3f8ab59371be5ddcaa8674688803b2dd3f21390f0b44d21a4029d
MD5 e66d62c6e39ad070799e84ed4924e5d3
BLAKE2b-256 7955e62fb2e016cb927c490442f06ba7cf3b6ff55a2391cb4866517c212db93e

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 88f8767430ce940a935821368900ebd76dc457d4500b9daf72e96902b344ff63
MD5 f1353ecab78c20eaa89ae0663b731067
BLAKE2b-256 a331ecd68251a0a432e76c70d0d5e68d7afd49d3036cdb58e92a69e1da2779a4

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 0b143673431ad9a154aca3af122bdb3cb9341d4a1acf29f1867da669008d95aa
MD5 92c8eb56f8dac5a63a596c3f0b5eb446
BLAKE2b-256 6d56ef4b7af8484ecd0f81e471c955bba6f62f1ae8b774492c1f10b33b0403c2

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 66c0d978b5d20fc08c8f830b91cfe2e8662f6a5e5c32fbc28679e67c6889715f
MD5 2d01c9d7f9eb323e1b1de4838ea0b886
BLAKE2b-256 63312cbaf84b4b5cfdd379077f1cc19a2db39e87dad000421e844d003d11cb99

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a65b2dd28179b45b05341e037362bad9abe5aa140420cb0c333099576b52d32a
MD5 6c7be6edb4b8dbd59ae6e7399f9dcc2b
BLAKE2b-256 67262e10bee895b867ad5f9f0dacbe10c91cef4d58778faffa9552d49149f2f9

See more details on using hashes here.

File details

Details for the file nolzss-1.1.0-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 babedd01aca07aa9b5b70d9f065543d486ec84208d156c535201620ee1eec143
MD5 9e95243f44a1d6bd83e4b6689f6d52af
BLAKE2b-256 200122d7f3f73f213d9d3c895068a4b1d4b4b7339a9f2251b2979768a976da69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page