Skip to main content

A fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters.

Project description

KmerDecon

KmerDecon is a fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters. It generate detailed reports of contaminants in sequencing data.

Authors

  • Yujia Feng
  • Xiaoyi Chen
  • Yuxiang Li

Features

  • Automatic Parameter Optimization: Automatically determines the optimal k-mer length and adjusts parameters based on desired memory and false positive rate using tools like HyperLogLog.
  • Speed: Utilizes efficient hashing with MurmurHash3 for fast k-mer processing.
  • Memory Efficiency: Employs Bloom filters with dynamic sizing to balance memory usage and accuracy, capable of handling billions of k-mers with minimal RAM.
  • Scalability: Suitable for large datasets, such as whole-genome sequencing reads and large contamination sources like the human genome.
  • Detailed Reporting: Generates comprehensive reports on contamination levels across multiple samples and filters.
  • Real-Time Processing: Allows for decontamination during data streaming or generation, providing immediate feedback and contaminant removal. (TODO)

Installation

Prerequisites:

  • Python 3.6 or higher
  • pip package manager

Steps:

  1. Clone the repository:

    git clone https://github.com/skysky2333/KmerDecon
    
  2. Navigate to the project directory:

    cd KmerDecon
    
  3. Install the package:

    pip install .
    

Usage

1. Building the Bloom Filter

Generate a Bloom filter from contamination source sequences.

build-bloom-filter --contamination-fasta contamination.fasta --output-filter contamination_filter.bf

Optional Arguments:

  • kmer-length: Length of k-mers to generate (e.g., 31). If not provided, the tool determines the optimal k-mer length automatically.
  • max-memory: Maximum memory in GB for the Bloom filter. Adjusts parameters to fit within this limit.
  • false-positive-rate: Desired false positive rate (default: 0.001).
  • expected-elements: Expected number of unique k-mers. If not provided, it is estimated using HyperLogLog.

2. Decontaminating Reads

Filter out contaminated reads from your sequencing data.

decontaminate-reads --input-reads reads.fastq(can_also_be_directory) --bloom-filter contamination_filter.bf(can_also_be_directory) --output-dir output_directory

Optional Arguments:

  • threshold: Fraction of matching k-mers to consider a read contaminated (default: 0.5).
  • kmer-length: Length of k-mers used. If not provided, the k-mer length from the Bloom filter is used.
  • mode: Operation mode, either filter (default) or states.
    • filter: Filters reads based on contamination levels.
    • states: Generates a states.csv report with contamination statistics. Columns:
      • {filter}_avgSimilarity: The average fraction of matching k-mers across all reads in that file for each filter.
      • {filter}_percentReadsPassing: The percentage of reads passing the threshold for each filter.

Dependencies

  • bitarray>=2.1.0
  • biopython>=1.78
  • mmh3>=2.5.1
  • hyperloglog>=0.0.12

Install dependencies with:

pip install -r requirements.txt

Contributing

Contributions and PRs are welcome!

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or suggestions, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmerdecon-0.1.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

KmerDecon-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file kmerdecon-0.1.0.tar.gz.

File metadata

  • Download URL: kmerdecon-0.1.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for kmerdecon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e51fba87c4074f56f3ed04689d4e2153c2ce994c722d4cf5a5d791092375edf
MD5 0dd92ede47c6fb49882c64ad1f7ead9b
BLAKE2b-256 dd7686d03ef89fb952061fbdb9587d5341213bfcc14ced2dea15f1db574a2bc7

See more details on using hashes here.

File details

Details for the file KmerDecon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: KmerDecon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for KmerDecon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6a9323e5443ddfa5e4c066fa1f1ed6f2702a4f7dc09d427e3d8f40e34cb06207
MD5 dc93146f200495f4a11f378181261633
BLAKE2b-256 04aeeba09ff5a46e836e9f39a16c5a1df87e7b90b993789b9202362128362f5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page