Skip to main content

Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)

Project description

noLZSS

Build Wheels Documentation noLZSS Logo

Non-overlapping Lempel–Ziv–Storer–Szymanski factorization

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Features

  • 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
  • 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
  • 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
  • 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
  • 🔧 Flexible API: Support for both strings and files with optional performance hints
  • 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
  • C++ Extensions: High-performance C++ implementations for memory-intensive operations

Installation

From Source (Development)

git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .

Requirements

  • Python 3.8+
  • C++17 compatible compiler
  • CMake 3.20+

Quick Start

Basic Usage

import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]

Working with Files

# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")

Genomics Applications

import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")

Statistical Significance Analysis

Distinguish signal from noise in genomic factorizations by comparing with shuffled genomes:

from noLZSS.genomics import calculate_factor_length_threshold

result = calculate_factor_length_threshold(
    "genome_factors.bin",
    "shuffled_factors.bin",
    tau_expected_fp=1.0
)

print(f"Minimum significant factor length: {result['L_star']}")

See Factor Length Significance Analysis for detailed methodology.

Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

Local Batch Processing

# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4

LSF Cluster Batch Processing

# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement

The LSF batch processor:

  • Estimates time, memory, and disk requirements from benchmark data
  • Allocates optimal threads per job based on file size
  • Submits jobs with bsub and tracks completion
  • Provides consolidated failure reports without log spam

See LSF Batch Factorizer Documentation for details.

Algorithm Details

The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:

  • Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
  • Range Minimum Queries: For efficient lowest common ancestor computations
  • Sink-based Processing: Memory-efficient processing using callback functions

Tie-Breaking for DNA Factorization

When factorizing DNA sequences with reverse complement awareness, if both a forward match and a reverse complement match have the same length, the forward match is preferred. Among candidates of the same type, the one with the earliest position wins.

Performance

  • Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
  • Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
  • Memory Usage: File-based processing uses minimal memory for large files
  • C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations

Documentation

Complete documentation is available at omerkerner.github.io/noLZSS

The documentation includes:

  • Python API Reference: Complete Python API with examples and parameter descriptions
  • C++ API Reference: Auto-generated C++ API documentation from source code
  • Genomics Module: Specialized functions for biological sequence analysis
  • Examples and Tutorials: Comprehensive usage examples and best practices

Development

Building from Source

# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

License

This project is licensed under the BSD 3-Clause License (see LICENSE).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.

Citation

If you use this library in your research, please cite:

@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nolzss-1.2.0.tar.gz (5.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nolzss-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (643.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nolzss-1.2.0-cp312-cp312-macosx_11_0_arm64.whl (421.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nolzss-1.2.0-cp312-cp312-macosx_10_15_x86_64.whl (466.0 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

nolzss-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (645.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nolzss-1.2.0-cp311-cp311-macosx_11_0_arm64.whl (421.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nolzss-1.2.0-cp311-cp311-macosx_10_15_x86_64.whl (464.0 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

nolzss-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (643.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

nolzss-1.2.0-cp310-cp310-macosx_11_0_arm64.whl (419.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nolzss-1.2.0-cp310-cp310-macosx_10_15_x86_64.whl (462.9 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

nolzss-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (644.1 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

nolzss-1.2.0-cp39-cp39-macosx_11_0_arm64.whl (420.0 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nolzss-1.2.0-cp39-cp39-macosx_10_15_x86_64.whl (462.9 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

nolzss-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (643.5 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

nolzss-1.2.0-cp38-cp38-macosx_11_0_arm64.whl (419.7 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nolzss-1.2.0-cp38-cp38-macosx_10_15_x86_64.whl (462.6 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

File details

Details for the file nolzss-1.2.0.tar.gz.

File metadata

  • Download URL: nolzss-1.2.0.tar.gz
  • Upload date:
  • Size: 5.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nolzss-1.2.0.tar.gz
Algorithm Hash digest
SHA256 faa9271db350b49e372a38160d32785c3460e45621a4a3fcda0be323576b1a62
MD5 69961ce0cc3463d06daa9564d6f7b3dd
BLAKE2b-256 e7836bcab14fbe36be3abd86fbef76c1baa62c95fa4d545fcc025f6cb6c9504f

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 672c876be0a1b857c750e03a32cc00e20db3dae28147340eee473622c5678c9c
MD5 55ca6e6e6a479ed2063826c261e96beb
BLAKE2b-256 7ab45cac88fba5184bcc39bc3e2a86c31fe103b132f8fd193018ac833a8651aa

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e42a1663ccc6d415714d6eaccfd776d40d70643379d6ce155ffa1151a074f1a
MD5 ea437de5d6ef5c187d47b1b54abf9b67
BLAKE2b-256 e6b9139df39596b12b1cda54514e3af30bd30f320cc358af42586d92d6a5ff84

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4b4581c60738d49cd5ff44e420c7185e96ddb22a674578b56b8b9f6b8adbcac8
MD5 b9f8610ec1b2f921136a34520c3ee276
BLAKE2b-256 9d3543761b79fa1a0dd392e9f59b0b7073d5f16f84d5ee37a4e26f60b8dec23f

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75641020f915a07f0d2742f64430b707cd0c2980511d3893c713965bfe6ff30d
MD5 3dfb9e92fdd2aef4119e26b973e1bd11
BLAKE2b-256 c367d8ea72e71a3ef11e93648a9fd55e8efc5cfc3733f3583fb9260072ed093a

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9363fce8f7006d959014a1da5036bbbd159937f9ef9e860a703727e951f00934
MD5 4fdb3935c296def5dbc50867b1a7927d
BLAKE2b-256 2d2b546162b299e195b2d89c34665e72e4c707aa619b7008d13e18c00ecc6db6

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7b82dc1029299169db3e52fa943853db26067d016d61f9d0b323d6cc5f9bc864
MD5 60d9281edd5ee09c884d10283a19c9ae
BLAKE2b-256 6b866578c59e8ca03a5350275b49efca3da0dc7dea28cbfc5e0f355a91bfac57

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7352cad68aee813fb4fd037f89b31786363d0d35874fbb700b5df826a8a6a39b
MD5 e49f3d090cdeca9a1fff0ce7eb7af31d
BLAKE2b-256 4c7c4a2dc4b71aa6757a9711f9c0d49117c8ea3ec7d469ba693d39c7163a768e

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 51821b98031cd1d3b92ddb39f1a61e16fea190dc06a2d0d10186fb29d6e06df5
MD5 fe7d902b84fe921acee2dea5872c49e8
BLAKE2b-256 30220fc8aa5d14e90312c525fb5a9614b1aeda4f0b34e54cd3f7c09680faad8b

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 fd2b62d18488eeee8f8bdec60ce2c7f167ddbf404b9b05c0acba27078dd61e53
MD5 8b404e98eb631f9e001ee6ff0f087394
BLAKE2b-256 fdedebaccf15ab024fb32ba553cba97c43f5aef52de1774e841c631ab147d70c

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3b4afc218ea9e2b0a3723382ee2a24f8676346813a8f2c99fceff728374751a8
MD5 8d2f1a9f672ee112ebc3ccc0f7bed4cf
BLAKE2b-256 766120687f8d2dfc1a1805732deca72d6e9897323ccf90ec464b052ed6665d85

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 abb08772ca35b8cbd650adbd2067235f385a6325e31cd27c5362695a8b80ef60
MD5 eb79b1b9ade4ed7e31ac86e8e64771fc
BLAKE2b-256 b543b4c908c966840cdde57d4f74b7a057009408d7dbdc3e334516305924eb28

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 27b4c479607b7e795727aa38818c93c6d1bdcdb8a6f61b353d8a6936e390299b
MD5 17b3b129499539ceb348d64b750fe776
BLAKE2b-256 d1457d8290952704e48a6f57f61a3d043eb46544fc290b2326b4a2f71ce60a67

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3acde4403aa05303c20192a5877ace7f4b413d175a9dbb7f696eb7c3b036b356
MD5 911ca852f9eb45202c1e3a2555c98e17
BLAKE2b-256 a5b94c4281910334d573144fbf155eaa5c5f92a7d2feac63694ea4da85788f8a

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8cbf00438e41d2f5662ca95fea0ccacd8f5a9db84521b03ec11101500c6fec2f
MD5 9e392456310d35da9712f454a64837ec
BLAKE2b-256 4c11f1043e6c572ccf8fac76a91d04683b117d83cdeef9930e14d70f021f648f

See more details on using hashes here.

File details

Details for the file nolzss-1.2.0-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.2.0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 99553b62dd7ec9eb1a0cebabc40ce6a7348e4fc0fbb33d0d57538db343630380
MD5 8dfdd9b612ab51e0ad6ef1e266097f2c
BLAKE2b-256 26aac5b00253e63af3faf444cfded9cf1d99c8e1f5e84dd52303d60b8372cb20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page