A fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters.
Project description
KmerDecon
KmerDecon is a fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters. It generate detailed reports of contaminants in sequencing data.
Authors
- Yujia Feng
- Xiaoyi Chen
- Yuxiang Li
Features
- Automatic Parameter Optimization: Automatically determines the optimal k-mer length and adjusts parameters based on desired memory and false positive rate using tools like HyperLogLog.
- Speed: Utilizes efficient hashing with MurmurHash3 for fast k-mer processing.
- Memory Efficiency: Employs Bloom filters with dynamic sizing to balance memory usage and accuracy, capable of handling billions of k-mers with minimal RAM.
- Scalability: Suitable for large datasets, such as whole-genome sequencing reads and large contamination sources like the human genome.
- Detailed Reporting: Generates comprehensive reports on contamination levels across multiple samples and filters.
- Real-Time Processing: Allows for decontamination during data streaming or generation, providing immediate feedback and contaminant removal. (TODO)
Installation
Prerequisites:
- Python 3.6 or higher
- pip package manager
Steps:
-
Install directory:
pip install KmerDecon
-
Alternatively, to get the lastest version, you can clone the repository:
git clone https://github.com/skysky2333/KmerDecon cd KmerDecon pip install .
Usage
1. Building the Bloom Filter
Generate a Bloom filter from contamination source sequences. Use kbuild --help
for more detail.
kbuild -c contamination.fasta -o contamination_filter.bf
Optional Arguments:
kmer-length
: Length of k-mers to generate (e.g., 31). If not provided, the tool determines the optimal k-mer length automatically.max-memory
: Maximum memory in GB for the Bloom filter. Adjusts parameters to fit within this limit.false-positive-rate
: Desired false positive rate (default: 0.001).expected-elements
: Expected number of unique k-mers. If not provided, it is estimated using HyperLogLog.
2. Decontaminating Reads
Filter out contaminated reads from your sequencing data. Use kdecon --help
for more detail.
kdecon -i reads.fastq -b example_filter/hg38.bf -o output
Optional Arguments:
threshold
: Fraction of matching k-mers to consider a read contaminated (default: 0.5).kmer-length
: Length of k-mers used. If not provided, the k-mer length from the Bloom filter is used.mode
: Operation mode, either filter (default) or states.- filter: Filters reads based on contamination levels.
- states: Generates a states.csv report with contamination statistics. Columns:
- {filter}_avgSimilarity: The average fraction of matching k-mers across all reads in that file for each filter.
- {filter}_percentReadsPassing: The percentage of reads passing the threshold for each filter.
Dependencies
bitarray>=2.1.0
biopython>=1.78
mmh3>=2.5.1
hyperloglog>=0.0.12
Install dependencies with:
pip install -r requirements.txt
Contributing
Contributions and PRs are welcome!
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
For questions or suggestions, please open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kmerdecon-0.1.1.tar.gz
.
File metadata
- Download URL: kmerdecon-0.1.1.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cec997acb3936add90074dc673025284ef85e091161edbb0665e7baaccc02fe4 |
|
MD5 | dfc607468305448c52a2292d9e2afc36 |
|
BLAKE2b-256 | c5f70dd2074f564ed088ebbf1a3839334ec7d4af23dc75c624b2317814b6bd04 |
File details
Details for the file KmerDecon-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: KmerDecon-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7438502ae6870a6db787fe82e3850c2c8b32367d3434f7016b8fe118a34a38fd |
|
MD5 | 2187a1539d9468ef9201c622852e9cb5 |
|
BLAKE2b-256 | e52043631588935466b983a560dd8c9f90c91b7001397e8b334c5aea7b944bd6 |