Skip to main content

A program to estimate microbial richness from metagenomic sequencing data by counting k-mers

Project description

ESKRIM: EStimate with K-mers the RIchness in a Microbiome

ESKRIM is a reference-free tool that compares microbial richness in shotgun metagenomic samples by counting k-mers

Installation via pip

pip install eskrim

Usage

Basic usage

In this example, k-mer richness in a sample (sample1) consisting in two paired-end runs (run1 and run2) is computed.
Forward fastq files are taken as input. Results are saved in the file sample1.eskrim_stats.tsv

eskrim -i sample1.run1_1.fastq.gz sample1.run2_1.fastq.gz -n sample1 -s sample1.eskrim_stats.tsv

Quality control (adapters removal, read trimming) and contaminant removal (reads from the host genome) should be performed before using ESKRIM.

Run ESKRIM similarly for each sample to be compared. All TSV output files can be merged manually.

Advanced usage

Adjusting target read count for subsampling

Depending on the sequencing depth, the target number of reads to randomly draw from each sample (default = 10M) can be adjusted with the -r parameter.

eskrim -i sample1.run1_1.fastq.gz sample1.run2_1.fastq.gz -n sample1 -s sample1.eskrim_stats.tsv -r 5000000

Adjusting read length

All reads are trimmed to a given length (default = 80) because read length can vary between samples.
This length can be changed with the -l parameter.

eskrim -i sample1.run1_1.fastq.gz sample1.run2_1.fastq.gz -n sample1 -s sample1.eskrim_stats.tsv -l 100

Reproducibility

ESKRIM ensures reproducibility when using the same random number generator seed (default = 0).
To make read subsampling vary across executions, the parameters --seed can be used.

eskrim -i sample1.run1_1.fastq.gz sample1.run2_1.fastq.gz -n sample1 -s sample1.eskrim_stats.tsv --seed 1234

Interpreting the output file

ESKRIM saves the results in a TSV file consisting in several columns (-s parameter).

  • sample_name : sample name specified with -n parameter.
  • total_num_reads : number of reads in the sample before subsampling.
  • num_Ns_reads_ignored : number of reads with undetermined bases that were discarded.
  • num_too_short_reads_ignored : number of reads with undetermined bases that were discarded.
  • target_num_reads : target number of reads to draw during the subsampling step.
  • num_selected_reads : number of reads actually drawn after subsampling.
  • read_length : length at which reads were trimmed (-l parameter).
  • kmer_length : length of counted k-mers (-k parameter).
  • num_distinct_kmers : number of distinct kmers in subsampled reads.
  • num_solid_kmers : number of kmers seen at least twice.
  • num_mercy_kmers : number of non-solid kmers occuring in a read where all k-mers are not solid.
    From our experience, the sum 'num_solid_kmers + num_mercy_kmers' is an accurate proxy to compare microbial richness between samples.

WARNING: Do not consider results when num_selected_reads is strictly lower than target_num_reads.
In this case, ignore the samples concerned or decrease the number of reads to be drawn randomly (-r parameter).

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eskrim-1.0.8.tar.gz (20.1 kB view hashes)

Uploaded Source

Built Distribution

eskrim-1.0.8-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page