Skip to main content

A comprehensive CLI tool for genomic similarity network fusion and analysis.

Project description

GeneCast: A Scalable and Lightweight Framework for High-Throughput Gene Family Analysis and Function Prediction

Authors: GeneCast Team from iZJU


Abstract

We introduce GeneCast, a scalable framework that overcomes computational bottlenecks in large-scale gene analysis. By fusing nucleotide and protein features via Similarity Network Fusion (SNF) and employing efficient hierarchical clustering, GeneCast achieves rapid, high-accuracy annotation transfer with low algorithmic complexity.

Table of Contents


Introduction

Gene replication and species differentiation have generated large numbers of homologous genes. Orthologs usually keep similar functions across species, while paralogs may diverge after duplication. For gene families with incomplete or inconsistent annotations, using sequence and evolutionary information to make systematic functional predictions remains challenging.

GeneCast proposes a gene family analysis framework based on vectorized sequence representations. By extracting features from DNA and protein sequences and computing their similarities, the framework identifies potential family structures, clusters related genes, and builds interpretable trees for functional inference. This lightweight and reproducible pipeline supports gene family classification and function prediction, offering a practical tool for comparative genomics and paralog studies.

Workflow > Overview of the GeneCast Workflow: The pipeline integrates dual-pathway feature extraction (nucleotide and protein) via SNF, followed by hierarchical clustering and dynamic tree cutting for functional annotation.


Project Structure

.
├── benchmark/               # Benchmarking scripts and datasets
│   ├── benchmark_report.ipynb
│   ├── dist.py              # Benchmarking distance calculation
│   ├── plot_benchmark_metrics.py
│   ├── snf.py               # Benchmarking SNF
│   ├── visualization.py     # Benchmarking visualization
│   └── ward.py              # Benchmarking clustering
├── data/                    # Input sequences (FASTA)
│   ├── Actin_paralogs/
│   ├── Customer/
│   └── ifn_seqs/
├── src/                     # Source code package
│   └── genecast/
│       ├── demodata/        # Demo data included in package
│       ├── __init__.py
│       ├── cli.py           # CLI entry point
│       ├── dist.py          # Distance matrix computation module
│       ├── report_generator.py # HTML report generation module
│       ├── snf.py           # Similarity Network Fusion implementation
│       ├── visualization.py # Plotting functions
│       └── ward.py          # Hierarchical clustering module
├── demo.sh                  # Shell script for running the demo
├── LICENSE
├── MANIFEST.in
├── pyproject.toml           # Build configuration
├── README.md
├── setup.py                 # Installation script
└── workflow.png             # Workflow diagram

Installation

Dependencies

GeneCast requires Python 3 (>=3.11) and the following libraries, which pip will automatically install:

  • numpy
  • pandas
  • scipy
  • scikit-learn
  • matplotlib
  • umap-learn
  • pycirclize

Setup

It is highly recommended to install GeneCast within a virtual environment.

  1. Get into the Repository:
    cd GeneCast # Change into the cloned project directory
    
  2. Create and Activate a Virtual Environment:
    conda create -n genecast python=3.11
    conda activate genecast
    # Or using venv (standard Python virtual environments):
    # python -m venv .venv
    # source .venv/bin/activate # On Windows: .venv\Scripts\activate
    
  3. Install the Package: GeneCast uses pyproject.toml for modern Python packaging. Install it in editable mode for development, or as a regular package:
    # For a standard installation.
    # Recommend using pip to download
    pip install genecast
    
    # For development (editable install, changes to source code are immediately reflected)
    # pip install -e .
    
    # If you have issue connecting to PyPi, install the pre-build pypi .whl package.
    pip install ./dist/genecast-0.1.2-py3-none-any.whl
    
    # If you want to run from source, replace "genecast" to "python src/genecast/cli.py" in the command.
    # python src/genecast/cli.py --arg1 --arg2
    

Quick Start

GeneCast comes with a built-in demo command to run the full pipeline on internal test data (Actin gene family).

genecast demo

This will:

  1. Load the included actin_nuc.fa dataset.
  2. Run the complete pipeline (Dist -> SNF -> Ward -> Viz -> Report).
  3. Save results to output/demo_actin.

To run the full pipeline on your own data in one go:

genecast all --fasta "data/*.fa" --outdir results/my_analysis --prefix my_genes

Usage Workflow

The genecast CLI offers a modular approach. You can run the entire pipeline using genecast all or execute individual steps for finer control.

1. Data Preparation

GeneCast accepts multi-FASTA files (.fasta, .fa, .fna).

  • Input Requirement: Nucleotide Coding Sequences (CDS).
  • Integrity: Sequences should ideally begin with a start codon (e.g., ATG), end with a stop codon, and have a length divisible by three. The pipeline performs internal translation for protein feature extraction.

2. Preprocessing (Feature Extraction)

Compute distance matrices for Nucleotide and Protein features separately.

Command: dist

# Step 2a: Nucleotide Processing (k-mer features)
genecast dist --fasta "data/*.fa*" --mode nuc --kmer 5 --outdir results --prefix dataset_nuc

# Step 2b: Protein Processing (Physicochemical properties)
genecast dist --fasta "data/*.fa*" --mode prot --win 3 --outdir results --prefix dataset_prot

Parameters:

  • --mode: nuc for nucleotide k-mers, prot for amino acid properties.
  • --kmer: Length of nucleotide k-mers (default: 7). Note: The paper suggests k=7 for larger datasets.
  • --win: Sliding window size for protein properties (default: 4). Note: The paper suggests w=4.

3. Similarity Network Fusion (SNF)

Fuse the nucleotide and protein distance matrices into a single similarity network.

Command: snf

genecast snf \
  --dist-matrices results/dataset_nuc_dist.csv results/dataset_prot_dist.csv \
  --output-file results/fused_similarity.csv \
  --K-values 10 20 40 \
  --t-iter 20

Parameters:

  • --K-values: List of K neighbors for multi-scale SNF (default: 10 20 40).
  • --t-iter: Number of diffusion iterations (default: 20).

4. Clustering & Tree Construction

Perform Ward hierarchical clustering and estimate the optimal number of clusters ($k^*$) using the Eigengap Heuristic.

Command: ward

genecast ward \
  --input results/fused_similarity.csv \
  --labels results/dataset_nuc_labels.csv \
  --is-similarity \
  --outdir results \
  --prefix analysis

Parameters:

  • --is-similarity: Flag to indicate input is an SNF similarity matrix (not distance).
  • --max-k: Maximum $k$ to search for auto-estimation (default: 15).
  • --no-outlier: Disable outlier detection if desired.

5. Visualization & Reporting

Generate comprehensive plots (Heatmaps, Dendrograms, t-SNE) and an HTML report.

Command: viz

genecast viz \
  --nuc-dist results/dataset_nuc_dist.csv \
  --prot-dist results/dataset_prot_dist.csv \
  --fused-similarity results/fused_similarity.csv \
  --labels-path results/dataset_nuc_labels.csv \
  --outdir results/plots \
  --tree results/analysis_ward_clean.nwk

Command: report

genecast report --outdir results --prefix analysis

Key Algorithms

1. Feature Extraction

  • Nucleotide: Decomposes DNA into overlapping $k$-mers. Features are normalized frequency vectors.
  • Protein: Translates CDS to protein. Maps amino acids to physicochemical properties (Hydrophobicity, Volume, Charge). Averages these over a sliding window $w$ to form a property-based feature space.

2. Distance Metric

We compute pairwise Cosine Distance.

  • For nucleotides, we apply a squared transformation ($D = 1 - S^2$) to amplify divergence among closely related paralogs.
  • Mathematical definition: $$ D_{\text{nuc}}(\mathbf{A}, \mathbf{B}) = 1 - S(\mathbf{A}, \mathbf{B})^2 = 1 - \left( \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} \right)^2 $$

3. Multi-scale Similarity Network Fusion (SNF)

Adapts the SNF framework (Wang et al., 2014) for multi-omics sequences.

  • Metric: Uses Cosine distance instead of Euclidean.
  • Multi-scale: Executes cross-diffusion across a spectrum of $K$ values (e.g., 10, 20, 40) and averages the result to minimize parameter bias.

4. Outlier Detection

Calculates an anomaly score based on the average distance to $k$ nearest neighbors. Uses Tukey's fence ($Q_3 + \alpha \cdot \mathrm{IQR}$) to flag and exclude noise.

5. Optimal Cluster Number Estimation

Uses the Eigengap Heuristic on the Laplacian eigenvalues of the fused network. We extract the leading eigenvalues $0 \approx \lambda_1 \le \dots \le \lambda_{k_{max}}$ of the normalized Laplacian. The optimal number of clusters is identified by selecting the $k$ that maximizes the drop in affinity between consecutive eigenvalues: $$ k^* = \operatorname*{argmax}{2 \le k < k{max}} (\lambda_{k+1} - \lambda_k) $$

6. Hierarchical Clustering

Constructs a dendrogram using Ward's minimum variance method on the fused distance matrix. The tree is partitioned into $k^*$ clusters.


Technical Details

Performance

The framework is designed to be lightweight.

  • Complexity: Feature extraction is linear with sequence length. SNF and clustering operations are optimized for matrix operations (using numpy/scipy).
  • Running Time: Dependent on dataset size ($N$). SNF is roughly $O(N^2)$, but efficient for typical gene family sizes ($N < 5000$).

Accuracy Measures

  • External Validation: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI).
  • Internal Validation: Eigengap size, Silhouette scores (in reports).

References

  1. SNF: Wang, B., et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature methods.
  2. Ward's Method: Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association.
  3. Genecast Team: 3016, 3087, 3092, 3143, 3025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genecast-0.1.3.tar.gz (83.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genecast-0.1.3-py3-none-any.whl (93.6 kB view details)

Uploaded Python 3

File details

Details for the file genecast-0.1.3.tar.gz.

File metadata

  • Download URL: genecast-0.1.3.tar.gz
  • Upload date:
  • Size: 83.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for genecast-0.1.3.tar.gz
Algorithm Hash digest
SHA256 737aca135103c9bff8add99f67c200c6271930010a1044cef3f35a518a515439
MD5 3e02f3a89c5bf8fbcfd6241acc409da9
BLAKE2b-256 20b8767b8b1e180d575da5875081f337d1f19b571702f98dd00aec86aed54f4a

See more details on using hashes here.

File details

Details for the file genecast-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: genecast-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 93.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for genecast-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d857010691ea505b8db2314d6492b083adcda082adae9654df951d13c6230fe5
MD5 8bfd3136e3cc788041993e6c14dfbedf
BLAKE2b-256 ad21e00e2fb9619d27d6cca54a36bc00ae01d126775067493508fa7585144a30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page