Non-overlapping Lempel-Ziv-Storer-Szymanski factorization (high-performance C++ core with Python bindings)
Project description
noLZSS
Non-overlapping Lempel–Ziv–Storer–Szymanski factorization
High-performance Python library for text factorization using compressed suffix trees. The library provides efficient algorithms for finding non-overlapping factors in text data, with both in-memory and file-based processing capabilities. Based on a paper by Dominik Köppl - Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees
Features
- 🚀 High Performance: Uses compressed suffix trees (SDSL) for optimal factorization speed
- 💾 Memory Efficient: File-based processing for large datasets without loading everything into memory
- 🐍 Python Bindings: Easy-to-use Python interface with proper GIL management
- 📊 Multiple Output Formats: Get factors as lists, counts, or binary files
- 🔧 Flexible API: Support for both strings and files with optional performance hints
- 🧬 Genomics Support: Specialized functions for FASTA file processing of nucleotide and protein sequences
- ⚡ C++ Extensions: High-performance C++ implementations for memory-intensive operations
Installation
From Source (Development)
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
pip install -e .
Requirements
- Python 3.8+
- C++17 compatible compiler
- CMake 3.20+
Quick Start
Basic Usage
import noLZSS
# Factorize a text string
text = b"abracadabra"
factors = noLZSS.factorize(text)
print(factors) # [(0, 1, 0), (1, 1, 1), (2, 1, 2), (3, 1, 0), (4, 1, 4), (5, 1, 0), (6, 1, 6), (7, 4, 0)]
Working with Files
# Factorize text from a file
factors = noLZSS.factorize_file("large_text.txt")
print(f"Found {len(factors)} factors")
# Just count factors without storing them (memory efficient)
count = noLZSS.count_factors_file("large_text.txt")
print(f"Total factors: {count}")
# Write factors to binary file for later processing
noLZSS.write_factors_binary_file("input.txt", "factors.bin")
Genomics Applications
import noLZSS.genomics
# Process DNA sequences from FASTA file
results = noLZSS.genomics.read_nucleotide_fasta("sequences.fasta")
for seq_id, factors in results:
print(f"Sequence {seq_id}: {len(factors)} factors")
Batch Processing
For processing multiple FASTA files, noLZSS provides batch processing scripts:
Local Batch Processing
# Process multiple files locally with parallel downloads and factorization
python -m noLZSS.genomics.batch_factorize \
--file-list files.txt \
--output-dir results \
--mode with_reverse_complement \
--max-workers 4
LSF Cluster Batch Processing
# Process on LSF cluster with optimal resource allocation
python -m noLZSS.genomics.lsf_batch_factorize \
--file-list files.txt \
--output-dir results \
--max-threads 16 \
--mode with_reverse_complement
The LSF batch processor:
- Estimates time, memory, and disk requirements from benchmark data
- Allocates optimal threads per job based on file size
- Submits jobs with
bsuband tracks completion - Provides consolidated failure reports without log spam
See LSF Batch Factorizer Documentation for details.
Algorithm Details
The library implements the Non-overlapping Lempel-Ziv-Storer-Szymanski (LZSS) factorization algorithm using:
- Compressed Suffix Trees: Built using the SDSL (Succinct Data Structure Library)
- Range Minimum Queries: For efficient lowest common ancestor computations
- Sink-based Processing: Memory-efficient processing using callback functions
Performance
- Time Complexity: 𝒪(𝑛 lgϵ 𝑛) for factorization, where n is input length, and 𝜖 ∈ (0,1]
- Space Complexity: 𝒪(𝑛lg𝜎) for suffix tree construction, where 𝜎 is the alphabet size
- Memory Usage: File-based processing uses minimal memory for large files
- C++ Extensions: Specialized high-performance functions for memory-intensive genomics operations
Documentation
Complete documentation is available at omerkerner.github.io/noLZSS
The documentation includes:
- Python API Reference: Complete Python API with examples and parameter descriptions
- C++ API Reference: Auto-generated C++ API documentation from source code
- Genomics Module: Specialized functions for biological sequence analysis
- Examples and Tutorials: Comprehensive usage examples and best practices
Development
Building from Source
# Clone repository
git clone https://github.com/OmerKerner/noLZSS.git
cd noLZSS
# Install in development mode
pip install -e .
# Run tests
python -m pytest tests/
License
This project is licensed under the BSD 3-Clause License (see LICENSE).
The repository vendors third-party components (notably SDSL v3). Third-party license texts and attribution are provided in THIRD_PARTY_LICENSES.txt.
Citation
If you use this library in your research, please cite:
@software{noLZSS,
title = {noLZSS: Non-overlapping Lempel-Ziv-Storer-Szymanski factorization},
author = {Kerner, Omer},
url = {https://github.com/OmerKerner/noLZSS},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nolzss-0.3.2.tar.gz.
File metadata
- Download URL: nolzss-0.3.2.tar.gz
- Upload date:
- Size: 5.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
976e80b8abc2c691bb3373c2ee8c5bdce2fcc14b4b297467550c229270a77d78
|
|
| MD5 |
994e03f0f1152c15e4bd821c5de2ef1e
|
|
| BLAKE2b-256 |
d022039a1057a777b8bd74d14c5b328b93211f39ff91fd73919b03e1b8a6f6f7
|
File details
Details for the file nolzss-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 624.9 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e3ba772daa71c7387e3afde02f0166b6e4497358f6e88bbf62625631cdf1a1e
|
|
| MD5 |
f55cd7b5943046ee0e00089006d7fa81
|
|
| BLAKE2b-256 |
f52b1aa351468ff365f80573f0fb3a1369fce8bb104cef24369a11a2e692fc62
|
File details
Details for the file nolzss-0.3.2-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 399.4 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88ebf04a554965279e46f45e3eaec54493534931676f7065dc4cf7a97259138d
|
|
| MD5 |
a57f7e4fcf9cbdaae68bf16c7e3e43d4
|
|
| BLAKE2b-256 |
dd7c7dd534bd27e0d11872ee14d82225860ebf94bbd46e0843e0e5809fc000d1
|
File details
Details for the file nolzss-0.3.2-cp312-cp312-macosx_10_15_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp312-cp312-macosx_10_15_x86_64.whl
- Upload date:
- Size: 442.8 kB
- Tags: CPython 3.12, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02fdefcb91c9f7986f9203eb1deda9a24679a3a68a3f8dcf05701c60ae755064
|
|
| MD5 |
996c2e8fdaa6b5b770712b92a9ad688d
|
|
| BLAKE2b-256 |
19fdd6d0eb2fc2bdd11bf2db9a937eb398ed9618eeefcb8faad4ef5ef30211e7
|
File details
Details for the file nolzss-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 625.5 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5251191b004c5ef0ccc982d13f871af0238e6850ceacdac765f22b3b3fd99c8
|
|
| MD5 |
8b28486615a0961f5ad7431f65ec00d0
|
|
| BLAKE2b-256 |
4074510535b190242a0ee45b560811574cc64d8914d5139d1d7bf43d2e907f61
|
File details
Details for the file nolzss-0.3.2-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 398.9 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5b91afc4266000d0fb6c22baf1d41ea6af6758e6a4c2f2f8ecf2571dbd941d9
|
|
| MD5 |
bcb02021cfa3a2c5b204b5bb038daeea
|
|
| BLAKE2b-256 |
660c9c2909a4a1f14bb038658ad78bb66c56cdb203244d2f3b260e77ed25316b
|
File details
Details for the file nolzss-0.3.2-cp311-cp311-macosx_10_15_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp311-cp311-macosx_10_15_x86_64.whl
- Upload date:
- Size: 440.9 kB
- Tags: CPython 3.11, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb2b567695c1443275f28b78c0031bd8820bcf347b380d405bf6c55ec65c8768
|
|
| MD5 |
831a959eabd7eda746862c878590c6ca
|
|
| BLAKE2b-256 |
7cdd66b2353a27a6fcfca597f85558a069b999fc616aaf4f668f29e25916334b
|
File details
Details for the file nolzss-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 623.7 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1efb20b4ed489e103f5b634e48da0c502ab70c9778d35eef57aa6e0fbc89a471
|
|
| MD5 |
46dee8c98ce84914cf3b06e345c7dac1
|
|
| BLAKE2b-256 |
d5f662425f5ce8086d72695732349e862604eb320f95dfa024325dc2b7a085ce
|
File details
Details for the file nolzss-0.3.2-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 397.8 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a978b2e59e305ed7c46935e59446f613ec4221e737c53092ffff6c65b1d8b608
|
|
| MD5 |
6542a8462ef3c5f75ac582a405a40a60
|
|
| BLAKE2b-256 |
2e0059cc819390d0c0dce453eab5ffcac0cf61a01fb8505c26269a881018bfd5
|
File details
Details for the file nolzss-0.3.2-cp310-cp310-macosx_10_15_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp310-cp310-macosx_10_15_x86_64.whl
- Upload date:
- Size: 439.6 kB
- Tags: CPython 3.10, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c7d81679aa51ebe86733950179b9522ec54d04c3deadcc8640273b5b2e0f5e6
|
|
| MD5 |
26db74b999a88462b314883c1a56f259
|
|
| BLAKE2b-256 |
d386b3318f92d34841f25f5059dbd2ed544fa7d10159f86dbd3c7184f65c9324
|
File details
Details for the file nolzss-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 624.2 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a83beec74656c5df7994537ccbca7e3f464184c6bc802e2b759b385d59418ed8
|
|
| MD5 |
5645e3a5f2247cfaa2515cecc688f327
|
|
| BLAKE2b-256 |
fbbd922509fe66f8ab89ef5cc4a3c3fee15cba9fc8b51e57afe48302d6362a95
|
File details
Details for the file nolzss-0.3.2-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 397.9 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a133ab5e5b93018a5aebecdd85f96884dd58713a0738646a6f975a68c74ef0f0
|
|
| MD5 |
cea6bf7e5efe6e404e02c3eff834e92a
|
|
| BLAKE2b-256 |
8a472e66b01b5b68650113a33cd1152ac46d8efd2b07b3671391ad997a33f87f
|
File details
Details for the file nolzss-0.3.2-cp39-cp39-macosx_10_15_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp39-cp39-macosx_10_15_x86_64.whl
- Upload date:
- Size: 439.8 kB
- Tags: CPython 3.9, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff510124003f55c2106a10c8d87d151272f8f03ca25376d3bacf6bb2c898ea2f
|
|
| MD5 |
a7e701fef1110228253c2a9015780baa
|
|
| BLAKE2b-256 |
07d6e793ff9734a2e7df0285240f52a6cd2c4792fcd4033134aed6c7df119aa9
|
File details
Details for the file nolzss-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 623.7 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aaebb02e072edc2e695482a23fa5e35dbc6c2f0faae31fe19865357d1c98d97
|
|
| MD5 |
1b1d547eb5ea341ab94e6a78f7133791
|
|
| BLAKE2b-256 |
e5a4a4aaaa219b3e72c60866c178c301473c2011fbb11b70da3cd03a37a8e831
|
File details
Details for the file nolzss-0.3.2-cp38-cp38-macosx_11_0_arm64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp38-cp38-macosx_11_0_arm64.whl
- Upload date:
- Size: 397.7 kB
- Tags: CPython 3.8, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
637be0396a0ad8788e67156bb7228709e4757011104284396ef3d81be1a2b158
|
|
| MD5 |
ce7e1e6d0e73ee5c408ce3ddcc14a871
|
|
| BLAKE2b-256 |
a3e434567f7c5b3f0d6ccec3a8867eead2729e67f5e19a71f871b072d353f3f0
|
File details
Details for the file nolzss-0.3.2-cp38-cp38-macosx_10_15_x86_64.whl.
File metadata
- Download URL: nolzss-0.3.2-cp38-cp38-macosx_10_15_x86_64.whl
- Upload date:
- Size: 439.4 kB
- Tags: CPython 3.8, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0b53241f0da0514b5a20496aac29a9cbd0f2f9003d7133652a907dca4d0ce23
|
|
| MD5 |
e14706d243e38db539eff792179cafd7
|
|
| BLAKE2b-256 |
37aeea59e3909a7d9532ffecc6dfb62675f1c83ceb963439ae78d2ec90718d7d
|