Redundancy removal tool for cblaster hit sets

These details have not been verified by PyPI

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

cagecleaner

Outline

cagecleaner removes genomic redundancy from gene cluster hit sets identified by cblaster. The redundancy in target databases used by cblaster often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out.

Given a session file from a cblaster run (or from a CAGECAT run), cagecleaner retrieves all hit-associated genome assemblies, groups these into assembly clusters by ANI and identifies a representative assembly for each assembly cluster using skDER. In addition, cagecleaner can reinclude hits that are different at the gene cluster level despite the genomic redundancy, and this by different gene cluster content and/or by outlier cblaster scores. Finally, cagecleaner returns a filtered cblaster session file as well as a list of retained gene cluster IDs for easier downstream analysis.

workflow

Output

This tool will produce seven final output files - filtered_session.json: a filtered cblaster session file - filtered_binary.txt: a cblaster binary presence/absence table, containing only the retained hits. - filtered_summary.txt: a cblaster summary file, containing only the retained hits. - clusters.txt: the corresponding cluster IDs from the cblaster summary file for each retained hit. - genome_cluster_sizes.txt: the number of genomes in a dereplication genome cluster, referred to by the dereplication representative genome. - genome_cluster_status.txt: a table with scaffold IDs, their representative genome assembly and their dereplication status. - scaffold_assembly_pairs.txt: a table with scaffold IDs and the IDs of the genome assemblies of which they are part.

There are four possible dereplication statuses: - 'dereplication_representative': this scaffold is part of the genome assembly that has been selected as the representative of a genome cluster. - 'readded_by_content': this scaffold has been kept as it contains a hit that is different in content from the one of the dereplication representative. - 'readded_by_score': this scaffold has been kept as it contains a hit that has an outlier cblaster score. - 'redundant': this scaffold has not been retained and is therefore removed from the final output.

Installation

First set up a conda environment using the env.yml file in this repo, and activate the environment.

conda env create -y -f env.yml
conda activate cagecleaner

Then install cagecleaner inside this environment using pip. First check you have the right pip using which pip, which should point to the pip instance inside the cagecleaner environment.

pip install cagecleaner

Dependencies

cagecleaner has been developed on Python 3.10. All external dependencies listed below are managed by the conda environment, except for the NCBI EDirect utilities, which can be installed as outlined here.

NCBI EDirect utilities (>= v21.6)
NCBI Datasets CLI (v16.39.0)
skDER (v1.2.8)
pandas (v2.2.3)
scipy (v1.14.1)
BioPython (v1.84)
more-itertools (v10.5)

Usage

cagecleaner expects as inputs at least the cblaster binary and summary files containing NCBI Nucleotide accession IDs. A dereplication run using the default settings can be started as simply as:

cagecleaner -b binary.txt -s summary.txt

Help message:

usage: cagecleaner [-c CORES] [-h] [-v] [-o OUTPUT_DIR] [-b BINARY_FILE] [-s SUMMARY_FILE] [--validate-files]
                  [--keep-downloads] [--keep-dereplication] [--keep-intermediate]
                  [--download-batch DOWNLOAD_BATCH] [-a ANI] [--no-content-revisit] [--no-score-revisit]
                  [--min-z-score ZSCORE_OUTLIER_THRESHOLD] [--min-score-diff MINIMAL_SCORE_DIFFERENCE]

   cagecleaner: A tool to remove redundancy from cblaster hits.
   
   cagecleaner reduces redundancy in cblaster hit sets by dereplicating the genomes containing the hits. 
   It can also recover hits that would have been omitted by this dereplication if they have a different gene cluster content
   or an outlier cblaster score.
   
   cagecleaner first retrieves the assembly accession IDs of each cblaster hit via NCBI Entrez-Direct utilities, 
   then downloads these assemblies using NCBI Datasets CLI, and then dereplicates these assemblies using skDER.
   If requested, cblaster hits that have an alternative gene cluster content or an outlier cblaster score 
   (calculated via z-scores) are recovered.
                                    

General:
 -c CORES, --cores CORES
                       Number of cores to use (default: 1)
 -h, --help            Show this help message and exit
 -v, --version         show program's version number and exit

Input / Output:
 -o OUTPUT_DIR, --output OUTPUT_DIR
                       Output directory (default: current working directory)
 -b BINARY_FILE, --binary BINARY_FILE
                       Path to cblaster binary file
 -s SUMMARY_FILE, --summary SUMMARY_FILE
                       Path to cblaster summary file
 --validate-files      Validate cblaster input files
 --keep-downloads      Keep downloaded genomes
 --keep-dereplication  Keep skDER output
 --keep-intermediate   Keep all intermediate data. This overrules other keep flags.

Download:
 --download-batch DOWNLOAD_BATCH
                       Number of genomes to download in one batch (default: 300)

Dereplication:
 -a ANI, --ani ANI     ANI dereplication threshold (default: 99.0)

Hit recovery:
 --no-content-revisit  Do not recover hits by cluster content
 --no-score-revisit    Do not recover hits by outlier scores
 --min-z-score ZSCORE_OUTLIER_THRESHOLD
                       z-score threshold to consider hits outliers (default: 2.0)
 --min-score-diff MINIMAL_SCORE_DIFFERENCE
                       minimum cblaster score difference between hits to be considered different. Discards outlier
                       hits with a score difference below this threshold. (default: 0.1)

   Lucas De Vrieze, 2025
   (c) Masschelein lab, VIB

Example case

We provide two example cases in the folder examples in this repo.

In the first case, 1155 gene cluster hits from Staphylococcus spp. should be reduced to 37 non-redundant hits. Running the command for the inputs in subfolder input

cd N398V589S066P61
cagecleaner -b binary.txt -s summary.txt -o output

should give the five output files in a new subfolder output. This should take about 10' using 20 cores, depending on the download speed of your internet connection.

$ dir -1 output
cleaned_binary.txt
clusters.txt
genome_cluster_sizes.txt
genome_cluster_status.txt
mappings.txt

The second example case is substantially bigger. Here we queried MIBiG entry BGC0001171 (listeriolysinS), which yielded 18,610 gene cluster hits. cagecleaner should reduce this to 775 hits in about 12 h using 20 cores.

WARNING: This example requires over 100GB of disk space.

cd BGC0001171
cagecleaner -b binary.txt -s summary.txt -o output

Citations

cagecleaner relies heavily on the skDER genome dereplication tool and its main dependendy skani, so we give these proper credit.

Salamzade, R., & Kalan, L. R. (2023). skDER: microbial genome dereplication approaches for comparative and metagenomic applications. https://doi.org/10.1101/2023.09.27.559801
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3

Please cite the cagecleaner manuscript:

In preparation

License

cagecleaner is freely available under an MIT license.

Use of the third-party software, libraries or code referred to in the References section above may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

Project details

These details have not been verified by PyPI

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.5.0

Apr 21, 2026

1.4.5

Nov 14, 2025

1.4.4

Oct 13, 2025

1.4.3

Sep 23, 2025

1.4.2

Aug 23, 2025

1.4.1

Aug 20, 2025

1.4.0 yanked

Aug 20, 2025

Reason this release was yanked:

auxiliary bash scripts are not included correctly in the package

1.3.1

Aug 5, 2025

1.3.0

Aug 4, 2025

1.2.3

Jul 16, 2025

1.2.2

May 12, 2025

1.2.1

May 6, 2025

1.2.0

Apr 29, 2025

1.1.0

Feb 6, 2025

1.0.3

Feb 5, 2025

1.0.2

Feb 5, 2025

1.0.1

Feb 5, 2025

1.0.0

Jan 23, 2025

0.0.9.4

Jan 23, 2025

This version

0.0.9.3

Jan 21, 2025

0.0.9.2

Jan 21, 2025

0.0.9.1

Jan 21, 2025

0.0.9

Jan 21, 2025

0.0.8

Jan 21, 2025

0.0.7

Jan 21, 2025

0.0.6

Jan 17, 2025

0.0.5

Jan 13, 2025

0.0.4

Jan 9, 2025

0.0.3

Jan 8, 2025

0.0.2

Jan 7, 2025

0.0.1

Jan 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cagecleaner-0.0.9.3.tar.gz (18.6 kB view details)

Uploaded Jan 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cagecleaner-0.0.9.3-py3-none-any.whl (17.9 kB view details)

Uploaded Jan 21, 2025 Python 3

File details

Details for the file cagecleaner-0.0.9.3.tar.gz.

File metadata

Download URL: cagecleaner-0.0.9.3.tar.gz
Upload date: Jan 21, 2025
Size: 18.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.0

File hashes

Hashes for cagecleaner-0.0.9.3.tar.gz
Algorithm	Hash digest
SHA256	`23e087f40ae9d0ea18a20d99283d04fd5a65f94d9807f3cb1bffb0b5bd95d64e`
MD5	`3bb5889e05282af1cf88984d6f80a385`
BLAKE2b-256	`883812aa16792e578fa9afd4bdfef7cf5209a8295a6826e5cc61a0c2e3d6f4d3`

See more details on using hashes here.

File details

Details for the file cagecleaner-0.0.9.3-py3-none-any.whl.

File metadata

Download URL: cagecleaner-0.0.9.3-py3-none-any.whl
Upload date: Jan 21, 2025
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.0

File hashes

Hashes for cagecleaner-0.0.9.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06552170c86daa745569a0e8e389504ac5e32f7c95f8bbc3ac3d0559ab1bd3ef`
MD5	`70caa448bb82f72a87ec5139a602c84f`
BLAKE2b-256	`39b0e7cbe5c443e038c0048381a894c558ccc821e6488f5b37fdfd7568cab35c`

See more details on using hashes here.

CAGEcleaner 0.0.9.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cagecleaner

Outline

Output

Installation

Dependencies

Usage

Example case

Citations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes