A comprehensive CLI tool for genomic similarity network fusion and analysis.
Project description
GeneCast: A Scalable and Lightweight Framework for High-Throughput Gene Family Analysis and Function Prediction
Authors: GeneCast Team from iZJU
Abstract
We introduce GeneCast, a scalable framework that overcomes computational bottlenecks in large-scale gene analysis. By fusing nucleotide and protein features via Similarity Network Fusion (SNF) and employing efficient hierarchical clustering, GeneCast achieves rapid, high-accuracy annotation transfer with low algorithmic complexity.
Table of Contents
- Introduction
- Project Structure
- Installation
- Quick Start
- Usage Workflow
- Input/Output
- Key Algorithms
- Technical Details
Introduction
Gene replication and species differentiation have generated large numbers of homologous genes. Orthologs usually keep similar functions across species, while paralogs may diverge after duplication. For gene families with incomplete or inconsistent annotations, using sequence and evolutionary information to make systematic functional predictions remains challenging.
GeneCast proposes a gene family analysis framework based on vectorized sequence representations. By extracting features from DNA and protein sequences and computing their similarities, the framework identifies potential family structures, clusters related genes, and builds interpretable trees for functional inference. This lightweight and reproducible pipeline supports gene family classification and function prediction, offering a practical tool for comparative genomics and paralog studies.
> Overview of the GeneCast Workflow: The pipeline integrates dual-pathway feature extraction (nucleotide and protein) via SNF, followed by hierarchical clustering and dynamic tree cutting for functional annotation.
Project Structure
.
├── benchmark/ # Benchmarking scripts and datasets
│ ├── benchmark_report.ipynb
│ ├── dist.py # Benchmarking distance calculation
│ ├── plot_benchmark_metrics.py
│ ├── snf.py # Benchmarking SNF
│ ├── visualization.py # Benchmarking visualization
│ └── ward.py # Benchmarking clustering
├── data/ # Input sequences (FASTA)
│ ├── Actin_paralogs/
│ ├── Customer/
│ └── ifn_seqs/
├── src/ # Source code package
│ └── genecast/
│ ├── demodata/ # Demo data included in package
│ ├── __init__.py
│ ├── cli.py # CLI entry point
│ ├── dist.py # Distance matrix computation module
│ ├── report_generator.py # HTML report generation module
│ ├── snf.py # Similarity Network Fusion implementation
│ ├── visualization.py # Plotting functions
│ └── ward.py # Hierarchical clustering module
├── demo.sh # Shell script for running the demo
├── LICENSE
├── MANIFEST.in
├── pyproject.toml # Build configuration
├── README.md
├── setup.py # Installation script
└── workflow.png # Workflow diagram
Installation
Dependencies
GeneCast requires Python 3 (>=3.11) and the following libraries, which pip will automatically install:
numpypandasscipyscikit-learnmatplotlibumap-learnpycirclize
Setup
It is highly recommended to install GeneCast within a virtual environment.
- Get into the Repository:
cd GeneCast # Change into the cloned project directory
- Create and Activate a Virtual Environment:
conda create -n genecast python=3.11 conda activate genecast # Or using venv (standard Python virtual environments): # python -m venv .venv # source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install the Package:
GeneCast uses
pyproject.tomlfor modern Python packaging. Install it in editable mode for development, or as a regular package:# For a standard installation. # Recommend using pip to download pip install genecast # For development (editable install, changes to source code are immediately reflected) # pip install -e . # If you have issue connecting to PyPi, install the pre-build pypi .whl package. pip install ./dist/genecast-0.1.2-py3-none-any.whl # If you want to run from source, replace "genecast" to "python src/genecast/cli.py" in the command. # python src/genecast/cli.py --arg1 --arg2
Quick Start
GeneCast comes with a built-in demo command to run the full pipeline on internal test data (Actin gene family).
genecast demo
This will:
- Load the included
actin_nuc.fadataset. - Run the complete pipeline (Dist -> SNF -> Ward -> Viz -> Report).
- Save results to
output/demo_actin.
To run the full pipeline on your own data in one go:
genecast all --fasta "data/*.fa" --outdir results/my_analysis --prefix my_genes
Usage Workflow
The genecast CLI offers a modular approach. You can run the entire pipeline using genecast all or execute individual steps for finer control.
1. Data Preparation
GeneCast accepts multi-FASTA files (.fasta, .fa, .fna).
- Input Requirement: Nucleotide Coding Sequences (CDS).
- Integrity: Sequences should ideally begin with a start codon (e.g.,
ATG), end with a stop codon, and have a length divisible by three. The pipeline performs internal translation for protein feature extraction.
2. Preprocessing (Feature Extraction)
Compute distance matrices for Nucleotide and Protein features separately.
Command: dist
# Step 2a: Nucleotide Processing (k-mer features)
genecast dist --fasta "data/*.fa*" --mode nuc --kmer 5 --outdir results --prefix dataset_nuc
# Step 2b: Protein Processing (Physicochemical properties)
genecast dist --fasta "data/*.fa*" --mode prot --win 3 --outdir results --prefix dataset_prot
Parameters:
--mode:nucfor nucleotide k-mers,protfor amino acid properties.--kmer: Length of nucleotide k-mers (default: 7). Note: The paper suggests k=7 for larger datasets.--win: Sliding window size for protein properties (default: 4). Note: The paper suggests w=4.
3. Similarity Network Fusion (SNF)
Fuse the nucleotide and protein distance matrices into a single similarity network.
Command: snf
genecast snf \
--dist-matrices results/dataset_nuc_dist.csv results/dataset_prot_dist.csv \
--output-file results/fused_similarity.csv \
--K-values 10 20 40 \
--t-iter 20
Parameters:
--K-values: List of K neighbors for multi-scale SNF (default:10 20 40).--t-iter: Number of diffusion iterations (default: 20).
4. Clustering & Tree Construction
Perform Ward hierarchical clustering and estimate the optimal number of clusters ($k^*$) using the Eigengap Heuristic.
Command: ward
genecast ward \
--input results/fused_similarity.csv \
--labels results/dataset_nuc_labels.csv \
--is-similarity \
--outdir results \
--prefix analysis
Parameters:
--is-similarity: Flag to indicate input is an SNF similarity matrix (not distance).--max-k: Maximum $k$ to search for auto-estimation (default: 15).--no-outlier: Disable outlier detection if desired.
5. Visualization & Reporting
Generate comprehensive plots (Heatmaps, Dendrograms, t-SNE) and an HTML report.
Command: viz
genecast viz \
--nuc-dist results/dataset_nuc_dist.csv \
--prot-dist results/dataset_prot_dist.csv \
--fused-similarity results/fused_similarity.csv \
--labels-path results/dataset_nuc_labels.csv \
--outdir results/plots \
--tree results/analysis_ward_clean.nwk
Command: report
genecast report --outdir results --prefix analysis
Key Algorithms
1. Feature Extraction
- Nucleotide: Decomposes DNA into overlapping $k$-mers. Features are normalized frequency vectors.
- Protein: Translates CDS to protein. Maps amino acids to physicochemical properties (Hydrophobicity, Volume, Charge). Averages these over a sliding window $w$ to form a property-based feature space.
2. Distance Metric
We compute pairwise Cosine Distance.
- For nucleotides, we apply a squared transformation ($D = 1 - S^2$) to amplify divergence among closely related paralogs.
- Mathematical definition: $$ D_{\text{nuc}}(\mathbf{A}, \mathbf{B}) = 1 - S(\mathbf{A}, \mathbf{B})^2 = 1 - \left( \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} \right)^2 $$
3. Multi-scale Similarity Network Fusion (SNF)
Adapts the SNF framework (Wang et al., 2014) for multi-omics sequences.
- Metric: Uses Cosine distance instead of Euclidean.
- Multi-scale: Executes cross-diffusion across a spectrum of $K$ values (e.g., 10, 20, 40) and averages the result to minimize parameter bias.
4. Outlier Detection
Calculates an anomaly score based on the average distance to $k$ nearest neighbors. Uses Tukey's fence ($Q_3 + \alpha \cdot \mathrm{IQR}$) to flag and exclude noise.
5. Optimal Cluster Number Estimation
Uses the Eigengap Heuristic on the Laplacian eigenvalues of the fused network. We extract the leading eigenvalues $0 \approx \lambda_1 \le \dots \le \lambda_{k_{max}}$ of the normalized Laplacian. The optimal number of clusters is identified by selecting the $k$ that maximizes the drop in affinity between consecutive eigenvalues: $$ k^* = \operatorname*{argmax}{2 \le k < k{max}} (\lambda_{k+1} - \lambda_k) $$
6. Hierarchical Clustering
Constructs a dendrogram using Ward's minimum variance method on the fused distance matrix. The tree is partitioned into $k^*$ clusters.
Technical Details
Performance
The framework is designed to be lightweight.
- Complexity: Feature extraction is linear with sequence length. SNF and clustering operations are optimized for matrix operations (using
numpy/scipy). - Running Time: Dependent on dataset size ($N$). SNF is roughly $O(N^2)$, but efficient for typical gene family sizes ($N < 5000$).
Accuracy Measures
- External Validation: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI).
- Internal Validation: Eigengap size, Silhouette scores (in reports).
References
- SNF: Wang, B., et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature methods.
- Ward's Method: Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association.
- Genecast Team: 3016, 3087, 3092, 3143, 3025.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genecast-0.1.3.tar.gz.
File metadata
- Download URL: genecast-0.1.3.tar.gz
- Upload date:
- Size: 83.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
737aca135103c9bff8add99f67c200c6271930010a1044cef3f35a518a515439
|
|
| MD5 |
3e02f3a89c5bf8fbcfd6241acc409da9
|
|
| BLAKE2b-256 |
20b8767b8b1e180d575da5875081f337d1f19b571702f98dd00aec86aed54f4a
|
File details
Details for the file genecast-0.1.3-py3-none-any.whl.
File metadata
- Download URL: genecast-0.1.3-py3-none-any.whl
- Upload date:
- Size: 93.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d857010691ea505b8db2314d6492b083adcda082adae9654df951d13c6230fe5
|
|
| MD5 |
8bfd3136e3cc788041993e6c14dfbedf
|
|
| BLAKE2b-256 |
ad21e00e2fb9619d27d6cca54a36bc00ae01d126775067493508fa7585144a30
|