Skip to main content

Redundancy removal tool for gene cluster mining hit sets

Project description

CAGEcleaner

install with bioconda Conda Manuscript DOI

[!NOTE] CAGEcleaner supports all functional cblaster modes (remote, local, hmm). We do not recommend using sessions from one of the combi modes.

Description

CAGEcleaner removes genomic redundancy from gene cluster mining hit sets. The redundancy in typical genome mining target databases (e.g. NCBI nr) often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out efficiently.

Starting from a session file or hit table from a cblaster or CAGECAT run, CAGEcleaner dereplicates the hits based on a representative sample of the sequence regions that encode these hits (either full genomes or direct genomic neighbourhoods). In addition, CAGEcleaner can automatically retain additional hits associated with non-representative sequences if they exhibit significant diversity in gene cluster contents or sequence similarity. Finally, CAGEcleaner returns a filtered cblaster session file or hit table.

CAGEcleaner offers two dereplication approaches.

  • Full genome dereplication (default option): Dereplicates the full genome assemblies of the host organisms using an ANI-based approach via skDER, and retains the hits that are encoded by a representative assembly. The more conservative option that also takes the diversity of the host organism into account. Choose this option if you're concerned about preserving host diversity during compression, for example to identify HGT events.
  • Neighbourhood dereplication: Extracts a genomic region of a predefined length around each hit, clusters all extracted regions by sequence similarity using MMseqs2, and retains the hits associated with the representative genomic regions. The more aggressive option that ignores host diversity. Choose this option if losing host diversity is not an issue.

[!NOTE] Although CAGEcleaner has been designed to use in conjunction with cblaster, it supports output from other mining tools by converting your hit table to the cblaster hit table format. See the example output for the specifics.

workflow

Installation and more

For installation instructions, usage, explanations and more, head over to the CAGEcleaner wiki!

[!IMPORTANT] CAGEcleaner has no direct Windows support anymore. If you have a seemingly successful installation directly on your Windows system, you likely have installed v1.1.0, an old version with known bugs! There are alternative options to run CAGEcleaner on Windows.

Citations

If you found CAGEcleaner useful, please cite our manuscript:

De Vrieze, L., Biltjes, M., Lukashevich, S., Tsurumi, K., Masschelein, J. (2025) CAGEcleaner: reducing genomic redundancy in gene cluster mining. Bioinformatics https://doi.org/10.1093/bioinformatics/btaf373

CAGEcleaner relies heavily on the following tools, so please give these proper credit as well.

Salamzade, R., & Kalan, L. R. (2025). skDER and CiDDER: two scalable approaches for microbial genome dereplication. Microbial Genomics, 11(7), https://doi.org/10.1099/mgen.0.001438
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, https://doi.org/10.1038/nbt.3988

License

CAGEcleaner is freely available under an MIT license.

Use of the third-party software, libraries or code referred to in the References section above may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cagecleaner-1.5.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cagecleaner-1.5.0-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file cagecleaner-1.5.0.tar.gz.

File metadata

  • Download URL: cagecleaner-1.5.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.0

File hashes

Hashes for cagecleaner-1.5.0.tar.gz
Algorithm Hash digest
SHA256 0474489d1b2bdfbd98a38d9f305d3adbfb69b453cc10521625c43e2dcc28f72f
MD5 e30479a0593d54226b386cdd844087b4
BLAKE2b-256 c00b101d4f9695193aab3a47f5ea23ef1f988ab4802d271accb0e65f5e968b37

See more details on using hashes here.

File details

Details for the file cagecleaner-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: cagecleaner-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.0

File hashes

Hashes for cagecleaner-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 851f523af815558417026372606f6d826a987a276abb22550ccd673d221ac908
MD5 e6889e4dcf588cfd572d0ea695a86bf4
BLAKE2b-256 2e050c7b616aee8adc12bfe41cabd3e264d340aca44f2408a884df0dc2b9fc7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page