A comprehensive CLI tool for genomic similarity network fusion and analysis.

These details have not been verified by PyPI

Project description

GeneCast: A Scalable and Lightweight Framework for High-Throughput Gene Family Analysis and Function Prediction

Authors: GeneCast Team from iZJU

Abstract

We introduce GeneCast, a scalable framework that overcomes computational bottlenecks in large-scale gene analysis. By fusing nucleotide and protein features via Similarity Network Fusion (SNF) and employing efficient hierarchical clustering, GeneCast achieves rapid, high-accuracy annotation transfer with low algorithmic complexity.

Introduction
Project Structure
Installation
Quick Start
Usage Workflow
Input/Output
Key Algorithms
Technical Details

Introduction

Gene replication and species differentiation have generated large numbers of homologous genes. Orthologs usually keep similar functions across species, while paralogs may diverge after duplication. For gene families with incomplete or inconsistent annotations, using sequence and evolutionary information to make systematic functional predictions remains challenging.

GeneCast proposes a gene family analysis framework based on vectorized sequence representations. By extracting features from DNA and protein sequences and computing their similarities, the framework identifies potential family structures, clusters related genes, and builds interpretable trees for functional inference. This lightweight and reproducible pipeline supports gene family classification and function prediction, offering a practical tool for comparative genomics and paralog studies.

> Overview of the GeneCast Workflow: The pipeline integrates dual-pathway feature extraction (nucleotide and protein) via SNF, followed by hierarchical clustering and dynamic tree cutting for functional annotation.

Project Structure

.
├── benchmark/               # Benchmarking scripts and datasets
│   ├── benchmark_report.ipynb
│   ├── dist.py              # Benchmarking distance calculation
│   ├── plot_benchmark_metrics.py
│   ├── snf.py               # Benchmarking SNF
│   ├── visualization.py     # Benchmarking visualization
│   └── ward.py              # Benchmarking clustering
├── data/                    # Input sequences (FASTA)
│   ├── Actin_paralogs/
│   ├── Customer/
│   └── ifn_seqs/
├── src/                     # Source code package
│   └── genecast/
│       ├── demodata/        # Demo data included in package
│       ├── __init__.py
│       ├── cli.py           # CLI entry point
│       ├── dist.py          # Distance matrix computation module
│       ├── report_generator.py # HTML report generation module
│       ├── snf.py           # Similarity Network Fusion implementation
│       ├── visualization.py # Plotting functions
│       └── ward.py          # Hierarchical clustering module
├── demo.sh                  # Shell script for running the demo
├── LICENSE
├── MANIFEST.in
├── pyproject.toml           # Build configuration
├── README.md
├── setup.py                 # Installation script
└── workflow.png             # Workflow diagram

Installation

Dependencies

GeneCast requires Python 3 (>=3.11) and the following libraries, which pip will automatically install:

numpy
pandas
scipy
scikit-learn
matplotlib
umap-learn
pycirclize

Setup

It is highly recommended to install GeneCast within a virtual environment.

Get into the Repository:

cd GeneCast # Change into the cloned project directory

Create and Activate a Virtual Environment:

conda create -n genecast python=3.11
conda activate genecast
# Or using venv (standard Python virtual environments):
# python -m venv .venv
# source .venv/bin/activate # On Windows: .venv\Scripts\activate

Install the Package: GeneCast uses pyproject.toml for modern Python packaging. Install it in editable mode for development, or as a regular package:

# For a standard installation.
# Recommend using pip to download
pip install genecast

# For development (editable install, changes to source code are immediately reflected)
# pip install -e .

# If you have issue connecting to PyPi, install the pre-build pypi .whl package.
pip install ./dist/genecast-0.1.2-py3-none-any.whl

# If you want to run from source, replace "genecast" to "python src/genecast/cli.py" in the command.
# python src/genecast/cli.py --arg1 --arg2

Quick Start

GeneCast comes with a built-in demo command to run the full pipeline on internal test data (Actin gene family).

genecast demo

This will:

Load the included actin_nuc.fa dataset.
Run the complete pipeline (Dist -> SNF -> Ward -> Viz -> Report).
Save results to output/demo_actin.

To run the full pipeline on your own data in one go:

genecast all --fasta "data/*.fa" --outdir results/my_analysis --prefix my_genes

Usage Workflow

The genecast CLI offers a modular approach. You can run the entire pipeline using genecast all or execute individual steps for finer control.

1. Data Preparation

GeneCast accepts multi-FASTA files (.fasta, .fa, .fna).

Input Requirement: Nucleotide Coding Sequences (CDS).
Integrity: Sequences should ideally begin with a start codon (e.g., ATG), end with a stop codon, and have a length divisible by three. The pipeline performs internal translation for protein feature extraction.

2. Preprocessing (Feature Extraction)

Compute distance matrices for Nucleotide and Protein features separately.

Command: dist

# Step 2a: Nucleotide Processing (k-mer features)
genecast dist --fasta "data/*.fa*" --mode nuc --kmer 5 --outdir results --prefix dataset_nuc

# Step 2b: Protein Processing (Physicochemical properties)
genecast dist --fasta "data/*.fa*" --mode prot --win 3 --outdir results --prefix dataset_prot

Parameters:

--mode: nuc for nucleotide k-mers, prot for amino acid properties.
--kmer: Length of nucleotide k-mers (default: 7). Note: The paper suggests k=7 for larger datasets.
--win: Sliding window size for protein properties (default: 4). Note: The paper suggests w=4.

3. Similarity Network Fusion (SNF)

Fuse the nucleotide and protein distance matrices into a single similarity network.

Command: snf

genecast snf \
  --dist-matrices results/dataset_nuc_dist.csv results/dataset_prot_dist.csv \
  --output-file results/fused_similarity.csv \
  --K-values 10 20 40 \
  --t-iter 20

Parameters:

--K-values: List of K neighbors for multi-scale SNF (default: 10 20 40).
--t-iter: Number of diffusion iterations (default: 20).

4. Clustering & Tree Construction

Perform Ward hierarchical clustering and estimate the optimal number of clusters ($k^*$) using the Eigengap Heuristic.

Command: ward

genecast ward \
  --input results/fused_similarity.csv \
  --labels results/dataset_nuc_labels.csv \
  --is-similarity \
  --outdir results \
  --prefix analysis

Parameters:

--is-similarity: Flag to indicate input is an SNF similarity matrix (not distance).
--max-k: Maximum $k$ to search for auto-estimation (default: 15).
--no-outlier: Disable outlier detection if desired.

5. Visualization & Reporting

Generate comprehensive plots (Heatmaps, Dendrograms, t-SNE) and an HTML report.

Command: viz

genecast viz \
  --nuc-dist results/dataset_nuc_dist.csv \
  --prot-dist results/dataset_prot_dist.csv \
  --fused-similarity results/fused_similarity.csv \
  --labels-path results/dataset_nuc_labels.csv \
  --outdir results/plots \
  --tree results/analysis_ward_clean.nwk

Command: report

genecast report --outdir results --prefix analysis

Key Algorithms

1. Feature Extraction

Nucleotide: Decomposes DNA into overlapping $k$-mers. Features are normalized frequency vectors.
Protein: Translates CDS to protein. Maps amino acids to physicochemical properties (Hydrophobicity, Volume, Charge). Averages these over a sliding window $w$ to form a property-based feature space.

2. Distance Metric

We compute pairwise Cosine Distance.

For nucleotides, we apply a squared transformation ($D = 1 - S^2$) to amplify divergence among closely related paralogs.
Mathematical definition: $$ D_{\text{nuc}}(\mathbf{A}, \mathbf{B}) = 1 - S(\mathbf{A}, \mathbf{B})^2 = 1 - \left( \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} \right)^2 $$

3. Multi-scale Similarity Network Fusion (SNF)

Adapts the SNF framework (Wang et al., 2014) for multi-omics sequences.

Metric: Uses Cosine distance instead of Euclidean.
Multi-scale: Executes cross-diffusion across a spectrum of $K$ values (e.g., 10, 20, 40) and averages the result to minimize parameter bias.

4. Outlier Detection

Calculates an anomaly score based on the average distance to $k$ nearest neighbors. Uses Tukey's fence ($Q_3 + \alpha \cdot \mathrm{IQR}$) to flag and exclude noise.

5. Optimal Cluster Number Estimation

Uses the Eigengap Heuristic on the Laplacian eigenvalues of the fused network. We extract the leading eigenvalues $0 \approx \lambda_1 \le \dots \le \lambda_{k_{max}}$ of the normalized Laplacian. The optimal number of clusters is identified by selecting the $k$ that maximizes the drop in affinity between consecutive eigenvalues: $$ k^* = \operatorname*{argmax}{2 \le k < k{max}} (\lambda_{k+1} - \lambda_k) $$

6. Hierarchical Clustering

Constructs a dendrogram using Ward's minimum variance method on the fused distance matrix. The tree is partitioned into $k^*$ clusters.

Technical Details

Performance

The framework is designed to be lightweight.

Complexity: Feature extraction is linear with sequence length. SNF and clustering operations are optimized for matrix operations (using numpy/scipy).
Running Time: Dependent on dataset size ($N$). SNF is roughly $O(N^2)$, but efficient for typical gene family sizes ($N < 5000$).

Accuracy Measures

External Validation: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI).
Internal Validation: Eigengap size, Silhouette scores (in reports).

References

SNF: Wang, B., et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature methods.
Ward's Method: Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association.
Genecast Team: 3016, 3087, 3092, 3143, 3025.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Dec 12, 2025

0.1.2

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genecast-0.1.3.tar.gz (83.5 kB view details)

Uploaded Dec 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

genecast-0.1.3-py3-none-any.whl (93.6 kB view details)

Uploaded Dec 12, 2025 Python 3

File details

Details for the file genecast-0.1.3.tar.gz.

File metadata

Download URL: genecast-0.1.3.tar.gz
Upload date: Dec 12, 2025
Size: 83.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for genecast-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`737aca135103c9bff8add99f67c200c6271930010a1044cef3f35a518a515439`
MD5	`3e02f3a89c5bf8fbcfd6241acc409da9`
BLAKE2b-256	`20b8767b8b1e180d575da5875081f337d1f19b571702f98dd00aec86aed54f4a`

See more details on using hashes here.

File details

Details for the file genecast-0.1.3-py3-none-any.whl.

File metadata

Download URL: genecast-0.1.3-py3-none-any.whl
Upload date: Dec 12, 2025
Size: 93.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for genecast-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d857010691ea505b8db2314d6492b083adcda082adae9654df951d13c6230fe5`
MD5	`8bfd3136e3cc788041993e6c14dfbedf`
BLAKE2b-256	`ad21e00e2fb9619d27d6cca54a36bc00ae01d126775067493508fa7585144a30`

See more details on using hashes here.

genecast 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

GeneCast: A Scalable and Lightweight Framework for High-Throughput Gene Family Analysis and Function Prediction

Abstract

Table of Contents

Introduction

Project Structure

Installation

Dependencies

Setup

Quick Start

Usage Workflow

1. Data Preparation

2. Preprocessing (Feature Extraction)

3. Similarity Network Fusion (SNF)

4. Clustering & Tree Construction

5. Visualization & Reporting

Key Algorithms

1. Feature Extraction

2. Distance Metric

3. Multi-scale Similarity Network Fusion (SNF)

4. Outlier Detection

5. Optimal Cluster Number Estimation

6. Hierarchical Clustering

Technical Details

Performance

Accuracy Measures

References

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes