Skip to main content

CroCoDeEL is a tool that detects cross-sample (aka well-to-well) contamination in shotgun metagenomic data

Project description

CroCoDeEL: Cross-sample Contamination Detection and Estimation of its Level 🐊

install with conda PyPI DOI

Introduction

CroCoDeEL is a tool that detects cross-sample contamination (aka well-to-well leakage) in shotgun metagenomic data.
It accurately identifies contaminated samples but also pinpoints contamination sources and estimates contamination rates.
CroCoDeEL relies only on species abundance tables and does not need negative controls nor sample position during processing (i.e. plate maps).

logo

Installation

CroCoDeEL is available on bioconda:

conda create --name crocodeel_env -c conda-forge -c bioconda crocodeel
conda activate crocodeel_env

Alternatively, you can use pip with Python ≥ 3.12:

pip install crocodeel

Docker and Singularity containers are also available on BioContainers

For relatively small datasets (< 200 samples), CroCoDeEL can also be run directly from the web interface available at https://metagenopolis.github.io/CroCoDeEL_interpreter/#runCroCoDeEL

Installation test

To verify that CroCoDeEL is installed correctly, run the following command:

crocodeel test_install

This command runs CroCoDeEL on a toy dataset and checks whether the generated results match the expected ones.
To inspect the results, you can rerun the command with the --keep-results parameter.

Quick start

Input

CroCoDeEL takes as input a species abundance table in TSV format.
The first column should correspond to species names. The other columns correspond to the abundance of species in each sample.
An example is available here.

species_name sample1 sample2 sample3 ...
species 1 0 0.05 0.07 ...
species 2 0.1 0.01 0 ...
... ... ... ... ...

CroCoDeEL works with relative abundances. The table will automatically be normalized so the abundance of each column equals 1.

Important: CroCoDeEL relies on accurate estimation of low-abundance (subdominant) species.
We therefore strongly recommend using Meteor to generate the species abundance table.

Alternatively, MetaPhlAn4 or sylph can also be used, although their lower sensitivity for low-abundance species may reduce the detection of low-level contamination events.
Based on our benchmarks, we do not recommend using other taxonomic profilers, as they generally do not provide sufficiently accurate abundance estimates for subdominant species.

Using MetaPhlAn4

When using MetaPhlAn4, profiling should be performed at the SGB level using the option --tax_level t.
Alternatively, you can manually filter the abundance table to retain only SGB-level entries.
CroCoDeEL should then be run with the --filter-low-ab parameter, as described below.

Using sylph

For sylph, we recommend using GTDB representative genomes as the reference database and generating an MPA-style abundance table with sylph-tax.
The resulting abundance table should then be filtered to retain only species-level entries corresponding to the t__ taxonomic rank.

Search for contamination

Run the following command to identify cross-sample contamination:

crocodeel search_conta -s species_abundance.tsv -c contamination_events.tsv

CroCoDeEL will output all detected contamination events in the file contamination_events.tsv.
This TSV file includes the following details for each contamination event:

  • The contamination source
  • The contaminated sample (target)
  • The estimated contamination rate
  • The score (probability) computed by the Random Forest model
  • The species specifically introduced into the target by contamination

An example output file is available here.

If you are using MetaPhlan4, we strongly recommend filtering out low-abundance species to improve CroCoDeEL's sensitivity.
Use the --filter-low-ab option as shown below:

crocodeel search_conta -s species_abundance.tsv --filter-low-ab 20 -c contamination_events.tsv

Visualization of the results

Contaminations events can be visually inspected by generating a PDF file consisting in scatterplots.

crocodeel plot_conta -s species_abundance.tsv -c contamination_events.tsv -r contamination_events.pdf

Each scatterplot compares in a log-scale the species abundance profiles of a contaminated sample (x-axis) and its contamination source (y-axis).
The contamination line (in red) highlights species specifically introduced by contamination.
An example is available here.

Easy workflow

Alternatively, you can search for cross-sample contamination and create the PDF report in one command.

crocodeel easy_wf -s species_abundance.tsv -c contamination_events.tsv -r contamination_events.pdf

Results interpretation

CroCoDeEL is a decision-support tool and should not be considered a definitive contamination classification system. It may report false-positive contamination events, particularly for samples with similar species abundance profiles (e.g. longitudinal samples).

For this reason, we strongly recommend manually reviewing the scatterplots associated with each predicted contamination event to identify and discard potential false positives.
To learn how to interpret these scatterplots, please refer to this tutorial.

For a more efficient review workflow, we also recommend using the CroCoDeEL Interpretation Interface, which provides an interactive environment for exploring and validating CroCoDeEL results.

Reproduce results of the paper

Species abundance tables of the training, validation and test datasets are available in this repository.
You can use CroCoDeEL to analyze these tables and reproduce the results presented in the paper.
For example, to process Plate 3 from the Lou et al. dataset, first download the species abundance table:

wget --content-disposition 'https://entrepot.recherche.data.gouv.fr/api/access/datafile/:persistentId?persistentId=doi:10.57745/BH1RKY'

and then run CroCoDeEL:

crocodeel easy_wf -s PRJNA698986_P3.meteor.tab -c PRJNA698986_P3.meteor.crocodeel.tsv -r PRJNA698986_P3.meteor.crocodeel.pdf

Train a new Random Forest model

Advanced users can train a custom Random Forest model, which classifies sample pairs as contaminated or not.
You will need a species abundance table with labeled contaminated and non-contaminated sample pairs, to be used for training and testing.
To get started, you can download and decompress the dataset we used to train CroCoDeEL's default model:

wget --content-disposition 'https://entrepot.recherche.data.gouv.fr/api/access/datafile/:persistentId?persistentId=doi:10.57745/IBIPVG'
xz -d training_dataset.meteor.tsv.xz

Then, use the following command to train a new model:

crocodeel train_model -s training_dataset.meteor.tsv -m crocodeel_model.tsv -r crocodeel_model_perf.tsv

Finally, to use your trained model instead of the default one, pass it with the -m option:

crocodeel search_conta -s species_ab.tsv -m crocodeel_model.tsv -c conta_events.tsv

Citation

If you find CroCoDeEL useful, please cite:
Goulet, L. et al. "CroCoDeEL: accurate control-free detection of cross-sample contamination in metagenomic data" Nature Communications (2026). https://doi.org/10.1038/s41467-026-72637-9.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crocodeel-1.2.2.tar.gz (277.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crocodeel-1.2.2-py3-none-any.whl (275.0 kB view details)

Uploaded Python 3

File details

Details for the file crocodeel-1.2.2.tar.gz.

File metadata

  • Download URL: crocodeel-1.2.2.tar.gz
  • Upload date:
  • Size: 277.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.12.3 Linux/6.17.0-1010-azure

File hashes

Hashes for crocodeel-1.2.2.tar.gz
Algorithm Hash digest
SHA256 73220759afd602c5fa01f212a45943b0f9e9fa1bb8c57ea2e7399bf2ae91b645
MD5 163c042000a12e28bfd39ecd3dc12973
BLAKE2b-256 781981592277b16a37e86fe8325ae993a62bbbd6ef117be87030f86d8eb541c6

See more details on using hashes here.

File details

Details for the file crocodeel-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: crocodeel-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 275.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.12.3 Linux/6.17.0-1010-azure

File hashes

Hashes for crocodeel-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bed5c3e43fe6d926745bd3c2e5784aa473f27db193440cdf7e7f2ac964b269ec
MD5 1fa39f942c487be24aa618dc1743a4fa
BLAKE2b-256 a0e40a7c3071b27a6f83a3f72dc4d2f8e7fa7edba1a3b66fcff0a4ae541b2c6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page