Skip to main content

Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)

Project description

noLZSS

Build Wheels Documentation noLZSS Logo

Non-overlapping Lempel–Ziv–Storer–Szymanski factorization

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Features

  • 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
  • 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
  • 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
  • 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
  • 🔧 Flexible API: Support for both strings and files with optional performance hints
  • 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
  • C++ Extensions: High-performance C++ implementations for memory-intensive operations

Installation

From Source (Development)

git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .

Requirements

  • Python 3.8+
  • C++17 compatible compiler
  • CMake 3.20+

Quick Start

Basic Usage

import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]

Working with Files

# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")

Genomics Applications

import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")

Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

Local Batch Processing

# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4

LSF Cluster Batch Processing

# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement

The LSF batch processor:

  • Estimates time, memory, and disk requirements from benchmark data
  • Allocates optimal threads per job based on file size
  • Submits jobs with bsub and tracks completion
  • Provides consolidated failure reports without log spam

See LSF Batch Factorizer Documentation for details.

Algorithm Details

The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:

  • Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
  • Range Minimum Queries: For efficient lowest common ancestor computations
  • Sink-based Processing: Memory-efficient processing using callback functions

Tie-Breaking for DNA Factorization

When factorizing DNA sequences with reverse complement awareness, if both a forward match and a reverse complement match have the same length, the forward match is preferred. Among candidates of the same type, the one with the earliest position wins.

Performance

  • Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
  • Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
  • Memory Usage: File-based processing uses minimal memory for large files
  • C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations

Documentation

Complete documentation is available at omerkerner.github.io/noLZSS

The documentation includes:

  • Python API Reference: Complete Python API with examples and parameter descriptions
  • C++ API Reference: Auto-generated C++ API documentation from source code
  • Genomics Module: Specialized functions for biological sequence analysis
  • Examples and Tutorials: Comprehensive usage examples and best practices

Development

Building from Source

# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

License

This project is licensed under the BSD 3-Clause License (see LICENSE).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.

Citation

If you use this library in your research, please cite:

@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nolzss-1.1.1.tar.gz (5.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nolzss-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nolzss-1.1.1-cp312-cp312-macosx_11_0_arm64.whl (400.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nolzss-1.1.1-cp312-cp312-macosx_10_15_x86_64.whl (442.4 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

nolzss-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nolzss-1.1.1-cp311-cp311-macosx_11_0_arm64.whl (400.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nolzss-1.1.1-cp311-cp311-macosx_10_15_x86_64.whl (440.3 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

nolzss-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

nolzss-1.1.1-cp310-cp310-macosx_11_0_arm64.whl (398.7 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nolzss-1.1.1-cp310-cp310-macosx_10_15_x86_64.whl (439.0 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

nolzss-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

nolzss-1.1.1-cp39-cp39-macosx_11_0_arm64.whl (398.9 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nolzss-1.1.1-cp39-cp39-macosx_10_15_x86_64.whl (439.0 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

nolzss-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622.8 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

nolzss-1.1.1-cp38-cp38-macosx_11_0_arm64.whl (398.6 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nolzss-1.1.1-cp38-cp38-macosx_10_15_x86_64.whl (438.8 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

File details

Details for the file nolzss-1.1.1.tar.gz.

File metadata

  • Download URL: nolzss-1.1.1.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nolzss-1.1.1.tar.gz
Algorithm Hash digest
SHA256 105b3b051b274337506a09ff22d4338c1cc71364eb65eca8c5718c2d4f99279e
MD5 0bdbaf8ee41832c0f2242eed6967b5f5
BLAKE2b-256 ec6ed71cdd019486b1141bfa423f37b687a67a1227602367df9910b7a65cf5b9

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a988ccfafe0ecda8121cda3a461f930732c1cdc859f1904d5578a17aba50b6dc
MD5 9cc5931a64135460a1f8ad09aa9edd68
BLAKE2b-256 d19c632957147eeca8b75ea04816e3abe132b080a2b13e8bc7a3aac2d5ae7b9c

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aa917c17464de5306f125d9258aec08968114aa9525af78a4ae3c909ac4d82b5
MD5 0b3b7bb6367de88c58aba243c6a5d60f
BLAKE2b-256 2b6cc242be3dc572f71c6920f07dc03d8e781926dbca32608cd82f666e34957c

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8afc497c991f95169f2d9d1a15d86f0535ab444478f24e18902c32ee1ca7ddfe
MD5 b3d3142dcfafee8560ad6012d827aae8
BLAKE2b-256 2049d754dd2d3ac2aed325dac2632195442e3a2a6662b3fdb516352a69b080f0

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9a14565ffac86148e9c49d576b1691a8ca13ec1448544908db0556f363f14e37
MD5 65d51616656e15424f9ecc52774eeac5
BLAKE2b-256 13f36edaa501273505cd8ed4e127fcb3ace30bf192ca19abbc57e51ff95554ef

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 be7eee9fd20c5fd8cbb0e4109360c35a504db2628572c806d8b587ffea48747b
MD5 cde9fa68dee3b76ae563d90711cd2d71
BLAKE2b-256 18343a4456080760fb3f96b79903b702e79aeba94444f9ddac9cea644966cb0a

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4db8c3b036e2b9a6017b7e8375defb2b2c207772d964d8c7c476663bdafa554e
MD5 bee10c7c8b92a7d8e0ffc86e91383c5a
BLAKE2b-256 cb2298aeb4b8d2998ffe7d90d7bdeda8e2c42aad01fe6964d53430087542791d

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5876ec0754ba5755ea3ba8c4acad14f81bf7945d92834a5564c3b163592d25fc
MD5 d0c237222967878e867d436862a0ea57
BLAKE2b-256 559098c5604a7f7f39eb7450ee60cd6ebe4a63375de37eebbb45500ec33774d0

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d697db082e77868b42fccef635c0a78a0115cf4b79deb6c0259cc0c88142e4b9
MD5 15c57cd3c73afc371e0405cf8ceaae3c
BLAKE2b-256 2bbfcc0b8c3bc8f560149e46250d4f6619a190c14350b87217fa98d1e8fb8ccb

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ea847e6ff230bff5ac99cd95137f4264289735bbb61ea4e182906b464fc2b26c
MD5 08759a358464f3a59fb857a228cd5988
BLAKE2b-256 6f79cbdd91f8973831200809ee482da47645ea7a9f24e98cea72818045f757f6

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8831b322aed7162fb119cead66b3b165d279b1c782ccf20ff6b0b8f159a11e97
MD5 4a3a246102d79760b6fb870e1c8e350b
BLAKE2b-256 2763d398369f904d40df40f98789c1f479b95c15930e484c0267445151e38e57

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a4d7276e16ca29c8eea675201ed2d0e35d5f878ee722ca831b81c64ad0b878fa
MD5 653837293ba7b0db46e9e1767eae7676
BLAKE2b-256 e4e9e1a1c98c4c4a84f6424543556e89a380c26235de9e9135176567db6020d3

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8e6d5606759e154211f3920525b9f3080bffc372d1fdfc14cef2fc6c0767264a
MD5 0e9f632dd7f46b0902294fe99c30b6a9
BLAKE2b-256 f099fb2640f5159712bfc4e7bf7a231bb7e9ce6aa853856a6a55dc6f405e0a9b

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b374d72360fa03a006eba2d748df6d98630731b18b94f4b13b89d54d91830350
MD5 78f03048cd6c59d7627c5400e2c83e11
BLAKE2b-256 3cbe18f5d23a14ecf584a94fc039e139e8018774d473c7a7b3f04ddf8ba4962c

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 034bbf75fe1848be53ac7a83e32e2bd84f23b0f568d5ea653773a4b681d5ae76
MD5 1038272b9b27f93a06c569c18f03215d
BLAKE2b-256 ad2005c52b8a0286f94abd249be3eda08d79593181ab65836e8042cd7ebcdd3a

See more details on using hashes here.

File details

Details for the file nolzss-1.1.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d936012b59b3991f3f99a244250b5da1c2ecadb3f74f5b4e50090cb3bbe3b8e5
MD5 e0a8c63560dd32f64f4a6accd7dcfe92
BLAKE2b-256 20827b8a17b1abb0206dc4b3c338dfe957a2a97c5631d1e567dc40dc372666c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page