Skip to main content

Co-occurrence Locus and Orthologous Cluster Identifier

Project description

NOTE

Extensive alpha testing has been conducted, though this software is in a beta state. Errors are expected, often rerunning without changing parameters is sufficient to resume appropriately. Kindly raise git issues for errors - if you can find the bug, even better! Documentation is currently in the works.


PURPOSE

The most common gene cluster detection algorithms focus on canonical “core” biosynthetic functions many gene clusters encode, while overlooking uncommon or unknown cluster classes. These overlooked clusters are a potential source of novel natural products and comprise an untold portion of overall gene cluster repertoires. Unbiased, function-agnostic detection algorithms therefore provide an opportunity to reveal novel classes of gene clusters and more broadly define genome organization. CLOCI (Co-occurrence Locus and Orthologous Cluster Identifier) is an algorithm that identifies gene clusters using multiple proxies of selection for coordinated gene evolution. In the process, CLOCI circumscribes loci into homologous locus groups, which is an extension of orthogroups to the locus-level. Our approach generalizes gene cluster detection and gene cluster family circumscription, improves detection of multiple known functional classes, and unveils noncanonical gene clusters. CLOCI is suitable for genome-enabled specialized metabolite mining, and presents an easily tunable approach for delineating gene cluster families and homologous loci.


INSTALL

Please create a conda environment and manually install some dependencies

conda create -n cloci mycotools graph-tool python pip

Then install cloci into the environment

conda activate cloci
python3 -m pip install cloci

A conda package will be available in the future.


USE

Input dataset

CLOCI inputs a tab-delimitted file of genome metadata with the following columns:

#genus	species	strain	assembly_path	gffpath

or a preassembled MycotoolsDB. It is important to adequately sample a cluster's distribution to detect it. I thus generally recommend implementing CLOCI at least at the subphylum-level. This varies depending on the lineage's rate of microsynteny decay and the phylogenetic distance with which horizontal transfer occurs.

CLOCI fundamentally relies on reconstructing an microsynteny phylogeny that accurately depicts divergence in gene order between genomes. While CLOCI attempts to automatically detect near single-copy gene families for reconstructing this tree, it is recommended to explicitly input these near single-copy genes using the -f argument referencing a file of reference genes, separated by lines. Ideally, these same genes would be used to reconstruct a phylogenomic tree and the microsynteny topology will be constrained to this reference phylogenomic tree via the -c argument in conjunction with -r for selecting genomes to derive the outgroup branch from.

Hyperparameters

CLOCI default parameters have been tuned for our initial dataset on ~2,250 fungi across the kingdom. These should suffice for circumscribing homologous loci in most analyses, though the gene cluster family filtering parameters are ideally determined referencing known clusters from your particular dataset. By default, thresholds for all proxies of coordinated gene evolution are set to 0. These thresholds will vary for the type of clusters of interest and the lineage. I recommend compiling a dataset of known cluster reference genes, running CLOCI, identifying those genes in the output, determining the values for the reference cluster proxies, and then implementing the thresholds.

There are numerous hyperparameters that will drastically affect output quality. I suspect our pilot study reached a local maximum in terms of output quality, though a global maximum perhaps lies with further hyperparameter tuning.

Example

Extract a MycotoolsDB of Agaricomycotina

mtdb e -l Agaricomycotina > agaricomycotina.mtdb

Run CLOCI rooting upon the MRCA of two inputted genomes

cloci -d agaricomycotina.mtdb --root "<OME1>,<OME2>"

Resume a CLOCI run, i.e. to add proxy thresholds or resume following error

cloci -d agaricomycotina.mtdb -r <ROOT_OME> -o <PREVIOUS_DIR>



ON THE ALGORITHM

Pipeline

CLOCI

Recovery of 68 reference clusters

Recovery of 68 reference clusters

Boundary assessment of 33 reference clusters

Boundary assessment of 33 reference clusters

Common Errors

Memory error

OSError: [Errno 12] Cannot allocate memory

Simply resume specifying the run output directory in your command via -o <PREVIOUS_OUTPUT_DIR>

Single-copy gene detection

ERROR: could not detect 10 genes present in all genomes with median 2 copy number and less than 2 copy number standard deviation. Manually input focal homology groups.

Near single-copy genes were not automatically determined from the dataset. It is recommended to manually input a list of focal homology groups/genes via -f.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloci-0.3.0.tar.gz (509.0 kB view details)

Uploaded Source

Built Distribution

cloci-0.3.0-py3-none-any.whl (128.9 kB view details)

Uploaded Python 3

File details

Details for the file cloci-0.3.0.tar.gz.

File metadata

  • Download URL: cloci-0.3.0.tar.gz
  • Upload date:
  • Size: 509.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for cloci-0.3.0.tar.gz
Algorithm Hash digest
SHA256 46f5387ff625b79038a130f5d501e6dd18b3ccd8ec79d027e6ca78d97e8ad0c2
MD5 8cfbbbb0229a8f0e3ae9537d9dcc55dd
BLAKE2b-256 ef24285c4b02af43122a366ed5faa3d505161a890a63a4102f461dfe1c1a9986

See more details on using hashes here.

File details

Details for the file cloci-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cloci-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 128.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for cloci-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36b24479bd1021beda72525ba25180e553cb54744655a37ba45ce532298e369c
MD5 a5aa8ec9cd55dc7cb72ca54c5925c21b
BLAKE2b-256 826f17dcbde5d59ee13945798b541c9d3b27e5e8518eacf0240cc13081e7b04d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page