Skip to main content

Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)

Project description

noLZSS

Build Wheels Documentation noLZSS Logo

Non-overlapping Lempel–Ziv–Storer–Szymanski factorization

High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Features

  • 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
  • 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
  • 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
  • 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
  • 🔧 Flexible API: Support for both strings and files with optional performance hints
  • 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
  • C++ Extensions: High-performance C++ implementations for memory-intensive operations

Installation

From Source (Development)

git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .

Requirements

  • Python 3.8+
  • C++17 compatible compiler
  • CMake 3.20+

Quick Start

Basic Usage

import noLZSS

# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors)  # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]

Working with Files

# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")

# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")

# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")

Genomics Applications

import noLZSS.genomics

# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
    print(f"Sequence {seq_id}: {len(factors)} factors")

Batch Processing

For processing multiple FASTA files, noLZSS provides batch processing scripts:

Local Batch Processing

# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --mode with_reverse_complement \
    --max-workers 4

LSF Cluster Batch Processing

# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
    --file-list files.txt \
    --output-dir results \
    --max-threads 16 \
    --mode with_reverse_complement

The LSF batch processor:

  • Estimates time, memory, and disk requirements from benchmark data
  • Allocates optimal threads per job based on file size
  • Submits jobs with bsub and tracks completion
  • Provides consolidated failure reports without log spam

See LSF Batch Factorizer Documentation for details.

Algorithm Details

The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:

  • Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
  • Range Minimum Queries: For efficient lowest common ancestor computations
  • Sink-based Processing: Memory-efficient processing using callback functions

Tie-Breaking for DNA Factorization

When factorizing DNA sequences with reverse complement awareness, if both a forward match and a reverse complement match have the same length, the forward match is preferred. Among candidates of the same type, the one with the earliest position wins.

Performance

  • Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
  • Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
  • Memory Usage: File-based processing uses minimal memory for large files
  • C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations

Documentation

Complete documentation is available at omerkerner.github.io/noLZSS

The documentation includes:

  • Python API Reference: Complete Python API with examples and parameter descriptions
  • C++ API Reference: Auto-generated C++ API documentation from source code
  • Genomics Module: Specialized functions for biological sequence analysis
  • Examples and Tutorials: Comprehensive usage examples and best practices

Development

Building from Source

# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

License

This project is licensed under the BSD 3-Clause License (see LICENSE).

The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.

Citation

If you use this library in your research, please cite:

@software{noLZSS,
  title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
  author = {Kerner, Omer},
  url = {https://github.com/OmerKerner/noLZSS},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nolzss-1.1.2.tar.gz (5.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nolzss-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (626.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nolzss-1.1.2-cp312-cp312-macosx_11_0_arm64.whl (403.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nolzss-1.1.2-cp312-cp312-macosx_10_15_x86_64.whl (445.6 kB view details)

Uploaded CPython 3.12macOS 10.15+ x86-64

nolzss-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (626.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nolzss-1.1.2-cp311-cp311-macosx_11_0_arm64.whl (403.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nolzss-1.1.2-cp311-cp311-macosx_10_15_x86_64.whl (443.5 kB view details)

Uploaded CPython 3.11macOS 10.15+ x86-64

nolzss-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (625.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

nolzss-1.1.2-cp310-cp310-macosx_11_0_arm64.whl (402.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nolzss-1.1.2-cp310-cp310-macosx_10_15_x86_64.whl (442.2 kB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

nolzss-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (626.3 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

nolzss-1.1.2-cp39-cp39-macosx_11_0_arm64.whl (402.3 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nolzss-1.1.2-cp39-cp39-macosx_10_15_x86_64.whl (442.3 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

nolzss-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (625.9 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

nolzss-1.1.2-cp38-cp38-macosx_11_0_arm64.whl (402.1 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nolzss-1.1.2-cp38-cp38-macosx_10_15_x86_64.whl (442.0 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

File details

Details for the file nolzss-1.1.2.tar.gz.

File metadata

  • Download URL: nolzss-1.1.2.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nolzss-1.1.2.tar.gz
Algorithm Hash digest
SHA256 40843a3f625c2e022ba64dd48419fe81abd70d312c711470c6d9adc55cea80ce
MD5 a1d06786d1f8062fed1e197181c13beb
BLAKE2b-256 c8242be60d3438515b38cffd19b6cfba9a942e067145a6aed1efefe2923a8c06

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 caf63452eab6f52eac4b59274e91c4be1bc8cf47dc19cb5dda4bc5cf55296fdb
MD5 f73422953b522e6e2834391304840ee7
BLAKE2b-256 3ee5271fdf743c0123bec42ced55b82b9d0c3dfea401d0b84d02a725697c13e7

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aab127c4b6f4bfe6af8a27581997655da500651f23fe336f912b2390c623e09b
MD5 a7c553ab6506bbd6226201b2ebc42d94
BLAKE2b-256 dd93c37e7f2bcb804cd3f21fd7c73f0e52fdd826d924d271122f38a01b8d66f1

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 54c35d13c5d3ce07097eaa22dd7eff58a6073509d34030b610c6a594b571ab7a
MD5 d397ad7b8ac06c56bdc9a4e8c8bab43a
BLAKE2b-256 71ede1222776047906b33a101bf4b5fd404e96cfb928139a2e211d01c09ace56

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b88c47624838f1338835b2038c4ea4516ad3ab48d23b2405eb3b9e71401e327a
MD5 fb13399d69858439b65f289216729240
BLAKE2b-256 83416ef33ed00bdcd6824437eec979f67dc590a3d35b476f1a0aae433a80f002

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c08ed89ba8368f4e7a670df80af31ef1a814c5f532141e5c71af50282744f4a8
MD5 74adf071b31597ea57194a524aba27ff
BLAKE2b-256 7a8316f58a9501015145286cba10415cad71986d2ce82607e9337850b3edaf1f

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d292baea36c6713741a42ca24f9362de0f298cca77bf4b36b2ce4e5b06c75fce
MD5 b1de86276c62155ad1d128784fb46083
BLAKE2b-256 c4778699173f2734495ecada9eb2d3adc223e98d424acf1314c4dbd5f45f7baf

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3dfe843217fa071a8b5856dc0a0fc6b931ab8783b3bfaba57f53f2e8da99c6e
MD5 4c2b366b803d38df7916987920b15aeb
BLAKE2b-256 9bde9240a4415079bc0ea2d954892b4b3dca69889340f0fbb7cfceea8764ef88

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2ba0313936f36fcba624874b48259f583aadbbe7191273d37559fee3fe097d24
MD5 1558bec1915016c4ef342ba0e71480c6
BLAKE2b-256 11e91e03fa3cfea736f4ddf8dfc8e4e3ca1afed195d7bd3e06c4732f410547f5

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 5ad38de7250ab5b8d348145f0ee1e3547f9f71dba6646414e186a64a2b819e7e
MD5 d4c3b49ac8d81903b06d96ddf2dda622
BLAKE2b-256 657bc03fcc3d7ef2aee01522bf632938faf22066aade381a6fbbbf05bddb9387

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 18554fc78d09eb9d4f69fe4845880b5f445853afcc9541c08edecef34f74b932
MD5 37ad47bc4f2608a494a68682d865a8fd
BLAKE2b-256 a37fd8778e63406c615d1a830c5b62a3d3299c98863093c9b5e0c7f02acd417c

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a1e374db013f0c3608ed83d35e3e07ee2e8fdd9df1a3ce02a8f9ca11f6198806
MD5 74eefa32b2943510c2005672336b5083
BLAKE2b-256 25129dc3d70d4919003b0c6cb59db16f032e1e1c96735baf20f86dcfcdd9404f

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 b0d934a401d7e7354347718739f3ef57ca5c45b209049ae30ba4a4b84062d15c
MD5 8a57951f650bd058b815b612cf5f8309
BLAKE2b-256 5b7dc116292648a21483f61b8c071d72e445bdb29b4502014b72ce9eaf5ae50f

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0156b8fea0c8b8f2fa0a268d58719fdc837b39c0daa12a0b10082a3b4acaa5bd
MD5 22a0c82f412cfdddb5b85fe3e517ee5b
BLAKE2b-256 01c718be181bb3cf330bbe1529dc854c9ee17ea62f4652b60aae40f9d25e596d

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1e2035969b193687d889f119e1fd7f007f5808ba78d70f2645b0aaeb9e3d7b44
MD5 1bb0f12b84e28c5f4c3d342392735283
BLAKE2b-256 68333b036d552662520e7bec75c56f42d0329cf7f71611cd1798633a0324602a

See more details on using hashes here.

File details

Details for the file nolzss-1.1.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for nolzss-1.1.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 54781b6c73b538133197b6d2ab4d55db0cef6b802ff0e77550e512d69a520c08
MD5 03482a1434bf14d15f5f55a79e625411
BLAKE2b-256 4a00a5b329ffaf29b71cccdfa92bd07ea62a7d16c8e60f333e92ae1f391ace24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page