Skip to main content

A fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters.

Project description

PyPI version

KmerDecon

KmerDecon is a fast, memory-efficient tool for decontaminating sequencing reads using Bloom filters or Cuckoo filters. It generate detailed reports of contaminants in sequencing data.

Authors

  • Yujia Feng
  • Xiaoyi Chen
  • Yuxiang Li

Installation

Prerequisites:

  • Python 3.6 or higher
  • pip package manager

Steps:

Run the following command inside the directory:

pip install .

Usage

1. Building the Bloom Filter or Cuckoo Filter

Generate a Bloom filter from contamination source sequences. Generate a Cuckoo filter, use -s cuckoo. Use kbuild --help for more detail.

kbuild -c contamination.fasta -s bloom -o contamination_filter.bf

Optional Arguments:

  • kmer-length: Length of k-mers to generate (e.g., 31). If not provided, the tool determines the optimal k-mer length automatically.
  • expected-elements: Expected number of unique k-mers. If not provided, it is estimated using HyperLogLog.
  • exclude-filter: A .bf filter or .cms file path. If provided, any k-mers present in the excluded filter will not be encoded into the new build filter.
  • max-memory: Maximum memory in GB for the Bloom filter. Adjusts parameters to fit within this limit.
  • false-positive-rate: Desired false positive rate (default: 0.01).

if choose build Cuckoo filter:

  • capacity-of-cuckoofilter: The capacity of cuckoo filter

2. Decontaminating Reads

Filter out contaminated reads from your sequencing data. Use kdecon --help for more detail.

Use bloom filter:

kdecon -i reads.fastq -d example_filter/hg38.bf -s bloom -o output

Use -s cuckoo for Cuckoo filter.

Optional Arguments:

  • threshold: Fraction of matching k-mers to consider a read contaminated (default: 0.5).
  • kmer-length: Length of k-mers used. If not provided, the k-mer length from the Bloom filter is used.
  • mode: Operation mode, either filter (default) or states.
    • filter: Filters reads based on contamination levels.
    • states: Generates a states.csv report with contamination statistics. Columns:
      • {filter}_avgSimilarity: The average fraction of matching k-mers across all reads in that file for each filter.
      • {filter}_percentReadsPassing: The percentage of reads passing the threshold for each filter.

Performance

Highlights

  • With default parameters, we achieves FPR = 0.002%, FNR = 0.05% on simulated human reads decontamination task.
  • KmerDecon is memory efficient and uses 10 bits / kmer. (Popular too Kraken2 uses 32 bits / kmer)
  • KmerDecon is fast and takse 5 min to filter 1 million reads of 150bp each (kraken2 takes ~8min, both on single thread)
  • Multi-threads parallel building supported.

Full Reports

  • To read the full performance report, please see: Here
  • To recreate the results on the report, please see: Here

Dependencies

  • bitarray>=2.1.0
  • biopython>=1.78
  • mmh3>=2.5.1
  • hyperloglog>=0.0.12

Install dependencies manually with:

pip install -r requirements.txt

Referenced Code

The python module of cuckoofilter is adapted from:

Author: Huy Do

Repository: https://github.com/huydhn/cuckoo-filter/blob/master/cuckoo/filter.py

License: MIT

Contributing

Contributions and PRs are welcome!

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or suggestions, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmerdecon-0.2.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

KmerDecon-0.2.0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file kmerdecon-0.2.0.tar.gz.

File metadata

  • Download URL: kmerdecon-0.2.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for kmerdecon-0.2.0.tar.gz
Algorithm Hash digest
SHA256 13cc4661e1900056d7b064fae2d7aa561803f51253b1dd40faf1c224f4807aa5
MD5 f5aa5ed50cf3ad35283b9ae149f4ddc0
BLAKE2b-256 e9c8ef25a70a9bfa14f3e0b76f5394c5dda8cd5a64c70f5a932eabe13c7bde20

See more details on using hashes here.

File details

Details for the file KmerDecon-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: KmerDecon-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for KmerDecon-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b621ab547fb1a48eaf5bc8b682b4422d45d424a65dca7118d5c6cb139f1c46b5
MD5 13e9cbb6cc5f6a7ac8ea9e3872c83e37
BLAKE2b-256 7f80cd4eb8bcc1acefdb34da1541c9a8e2ca32315c3d7788bd7639131827cfde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page