Semi-automated protein discovery pipeline using BLAST, quality control, and language model clustering

These details have not been verified by PyPI

Project links

Project description

UHT-Discovery: a workspace for pLM-clust

This codebase is designed for semi-automated protein discovery, using BLAST tools for sequence collection, quality-control filtering, clustering using language models and in silico candidate selection. It is built around the plm-clust pipeline for enzyme discovery. See the preprint for more details.

Installation

From Source

# Clone the repository
git clone <repository-url>
cd uht-discovery-package

# Install the package
pip install -e .

From PyPI

pip install uht-discovery

Quick Start

Interactive GUI (Recommended)

Launch the interactive web interface:

uht-gui

Or as a Python module:

python -m uht_discovery.gui

The GUI will open in your browser at http://127.0.0.1:7860 and provides:

BLASTER: Interactive BLAST sequence search with progress tracking
TRIM: Visual length distribution and quality control
PLMCLUSTV2: Interactive clustering with t-SNE/UMAP visualizations

Web Deployment (Modal)

Modal deployment lives in a separate private wrapper repo (e.g., uht-discovery-web). That wrapper installs this package and exposes the NiceGUI app via create_fastapi_app(). Current production route is https://discovery.plmclust.co.uk via Cloudflare Worker -> Modal.

Command-Line Interface (CLI)

All tools are available as command-line utilities:

BLAST Sequence Search

uht-blast --project my_project --email your@email.com --hits 500

Sequence Trimming

uht-trim --project my_project --min-length 100 --max-length 500

Protein Clustering (PLMCLUSTV2 - Recommended)

uht-clust --project my_project --clusters 6

Legacy Clustering (PLMCLUST v1)

uht-clust-v1 --project my_project --clusters auto

Other Tools

uht-optimizer --project my_project
uht-mutation --project my_project
uht-comprehensive --project my_project
uht-biophysical --project my_project
uht-sequence --project my_project
uht-phylo --project my_project
uht-tsne --project my_project
uht-kmeans --project my_project

For detailed help on any command:

uht-blast --help
uht-trim --help
uht-clust --help
# etc.

Unified CLI Entry Point

You can also use the main entry point:

uht-discovery [command] [options]

Available commands: blast, trim, clust, clust-v1, optimizer, mutation, comprehensive, biophysical, sequence, phylo, tsne, kmeans, gui

General Structure

This code is set up so that configs and data are found in inputs/.../, where ... depends on the script that you care to run. Likewise, results are saved to results/.../

The currently available commands are:

uht-blast / uht-discovery blast - BLAST sequence search
uht-trim / uht-discovery trim - Sequence quality control
uht-clust / uht-discovery clust - PLMCLUSTV2 clustering (recommended)
uht-clust-v1 / uht-discovery clust-v1 - Legacy PLMCLUST clustering
uht-optimizer / uht-discovery optimizer - Mutation optimization
uht-mutation / uht-discovery mutation - Mutation testing
uht-comprehensive / uht-discovery comprehensive - Comprehensive benchmarking
uht-biophysical / uht-discovery biophysical - Biophysical signals analysis
uht-sequence / uht-discovery sequence - Sequence similarity analysis
uht-phylo / uht-discovery phylo - Phylogenetic analysis
uht-tsne / uht-discovery tsne - t-SNE visualization
uht-kmeans / uht-discovery kmeans - K-means benchmarking
uht-gui / uht-discovery gui - Launch interactive GUI

blaster

Retrieve sequences using NCBI's API

Very simply, place one or more .fasta sequences in inputs/blaster/PROJECT/

Where PROJECT is the name of your project. These sequences can be one, or multiple entries and/or fasta files.

You have several parameters to play with in config.yaml:

blaster_project_directory: PROJECT <- This is the name of the PROJECT to focus on. This allows you to hold multiple projects in inputs/ without BLASTing all of them.

blaster_num_hits: 500 <- Number of sequences to retrieve from NCBI

blaster_blast_db: nr <- database to search

blaster_evalue: 1e-5 <- e value cutoff

blaster_email: example@example.ac.uk <- your email (required to access NCBI's API)

You may then run using the CLI:

uht-blast --project PROJECT --email your@email.com

Or using the legacy run.py script:

./run.py blaster

The results will be saved in /results/blaster/PROJECT/... and they will be both the fasta file of hits, with the queries at the top of the list, and a BLAST report .txt file for traceability.

Hits are automatically de-duplicated.

trim_blaster

Perform basic (but important) quality control on the outputs of blaster.

After placing the BLASTer output (or indeed, any .fasta file), in inputs/trim/PROJECT/ where PROJECT is defined with trim_project_directory: in the config, simply run:

uht-trim --project PROJECT --min-length 100 --max-length 500

Or using the legacy run.py script:

./run.py trim

You will be presented with a histogram of gene lengths with the length of the top member of the .fasta (your first input sequence) indicated with a red line. Use this to select a sensible length cutoff range, which you then enter into the terminal after closing the plot. A trimmed output will be saved in results/trim_blaster/PROJECT/, with all non-amino acid entries (e.g. X) removed as well as any sequence failing the length criteria. A detailed report of the process and your selected thresholds will also be saved.

plm-clust

Cluster and rank the metagenome using ESM2

Place your .fasta files in inputs/PROJECT/...

Where you can name PROJECT anything you like. Specify the name PROJECT in plmclust_target_directory: PROJECT_NAME in config.yaml.

You can also set the number of clusters (plmclust_cluster_number) to any integer, or 'auto', which will automatically find an optimal number of clusters using silhouette scoring.

The perplexity scores contain a degree of randomness due to the random masking procedure - to account for this, you can set a number of replicates for this process, defined by plm_clust_replicates. You can set this to 0 if you are only interested in clustering and not scoring. If you elect to plm_clust_replicates >=3, you will also receive a standard deviation for the perplexity scores for each sequence.

Then, you can simply run using the CLI:

uht-clust-v1 --project PROJECT_NAME --clusters auto

Or using the legacy run.py script:

./run.py plmclust

Results will be found in /results/PROJECT/

These include a sub-directory containing a fasta file of each of the clusters, a .csv file pLM-clust_sequences_with_metrics_PROJECT.csv containing sequences, cluster identities, perplexities, and standard deviations.

The silhouette search will also be saved, along with coordinates to replicate the tSNE as a .csv file.

The top representative from each cluster is saved as top_cluster_representatives_PROJECT.fasta

plmclustv2

An improved version of plm-clust that uses single-pass scoring instead of masking-based scoring for better efficiency and consistency.

Key Improvements

Single-pass scoring: Uses direct NLL computation instead of random masking
Efficient execution: Computes embeddings and scores in one model pass
Consistent results: No randomness from masking procedures
Same interface: Maintains compatibility with plmclust v1

Setup

Place your .fasta files in inputs/plmclustv2/PROJECT/... where PROJECT is your project name.

Configure in config.yaml:

plmclustv2_project_directory: PROJECT_NAME
plmclustv2_cluster_number: 6  # or 'auto' for automatic selection

Usage

Using CLI (recommended):

uht-clust --project PROJECT_NAME --clusters 6

Using legacy run.py:

./run.py plmclustv2

Outputs

Results are saved in results/plmclustv2/PROJECT/:

pLM-clustv2_sequences_with_metrics_PROJECT.csv: Sequences with cluster assignments and NLL scores
top_cluster_representatives_PROJECT.fasta: Best representative from each cluster
tsne_coordinates_PROJECT.csv: t-SNE coordinates for visualization
tsne_clusters_PROJECT.png: t-SNE visualization plot
6_clusters_PROJECT/: Individual cluster FASTA files
run_log_PROJECT.txt: Detailed run log

Scoring Method

Unlike plmclust v1 which uses random masking, plmclustv2:

Computes embeddings and scores in a single model pass
Uses direct negative log-likelihood (NLL) scoring
Provides consistent, non-random results
Is significantly faster than the masking approach

When to Use

Use plmclustv2 for: Faster execution, consistent results, single-pass efficiency
Use plmclust v1 for: Legacy compatibility, masking-based scoring if specifically needed

optimizer

Leverage the outputs of plm-clust to probe a local fitness landscape, predicting mutants that will be improving when tested against many backgrounds in the cluster.

Optimise works by taking a .fasta file, then assessing each member for its average NLL according to ESM2. Taking the best sequence, we then build an MSA using ClustalO, and identify all single mutants present in the cluster relative to the reference. We then evaluate each single mutant in the context of a large number of homologues, taking the average effect of the mutation in the backgrounds as the fitness of the mutation. This is compared against the predicted effect of the mutation in the reference background. Improving mutations are then combined, and tested in the same way, up to a max_order.

Simply place a .fasta file in /inputs/optimizer/PROJECT/ and state the PROJECT in optimizer_project_directory.

Also set the max_sequences_to_de_denoise as the number of randomly-selected homologues to test each mutation in - a higher number will lead to better performance but slower speed. If you care about a specific reference sequence and don't want it auto-detected, set selected_sequence: to the fasta header of your sequence of interest.

Then run

./run.py optimizer

comprehensive_clustering

Perform comprehensive analysis of clustering results using PCA visualization and statistical comparison.

This module analyzes clustering results by comparing ESM2+kmeans clustering with MMseqs2 clustering, providing detailed visualizations and statistical insights into protein properties and clustering quality.

Setup

Place your cluster FASTA files in inputs/comprehensive_clustering/PROJECT/clusters/ where each file is named cluster_X_of_Y_PROJECT.fasta (e.g., cluster_0_of_6_bsgh11.fasta).

Configure the project in config.yaml:

comprehensive_clustering_project_directory: PROJECT_NAME

Usage

./run.py comprehensive_clustering

Outputs

The analysis generates comprehensive outputs in results/comprehensive_clustering/PROJECT/:

Figures:

figures/pca_plots/: PCA visualizations comparing clustering methods
figures/violin_plots/: Distribution analysis for 16 protein properties by cluster
figures/summary_plots/: Summary comparison plots

Data:

data/pca_results.csv: PCA coordinates and cluster assignments

Reports:

reports/comprehensive_analysis_report.md: Detailed analysis report

Analysis Components

PCA Analysis: Dimensionality reduction using 16 protein properties
Cluster-Specific Violin Plots: Distribution analysis for each metric by cluster
Feature Importance: Understanding which properties drive clustering
Statistical Comparison: Quantitative comparison of clustering methods
PCA-Clustering Agreement: Analysis of how well each method aligns with PCA structure

Key Metrics Analyzed

Protein length, molecular weight, GRAVY score, isoelectric point
Secondary structure content (helix, sheet, turn fractions)
Amino acid composition patterns (hydrophobic, hydrophilic, charged, etc.)
Charge at pH 7

Example Results

The analysis provides insights such as:

Which clustering method better aligns with protein property structure
Cluster-specific property distributions
Feature importance in driving clustering decisions
Statistical evidence of clustering quality differences

mask-free-nllvsperplexity

This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.

The other metric calculated here, mask-free-nll, runs a single-pass of the sequence through the model, and extracts the negative log liklihoods of each token, reporting the average.

This is compared against the 'perplexity' metric used by plm-clust, which masks 15% of the sequence at random and calculates the average NLL of the masked sequences. This process is repeated N times and the outputs averaged per sequence.

The output is a grid, with increasing replicates of masking from top to bottom and left to right, with the diagonal carrying the comparison to the mask-free-nll of each number of replicates. The numbers in the boxes represent the pearson R.

After setting up plm-clust inputs as normal, run

make mask-free-nllvsperplexity

saturationvsperplexity

This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.

The other metric calculated here, 'saturation', is the canonical way of calculating pseudo-NLL, carried out by masking each token sequentially and calculating the average NLL over each mask.

After setting up plm-clust inputs as normal, run

make saturationvsperplexity

singleinferencevsperplexity

This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.

The other metric calculated here, 'singleinference', is a sinlge-pass method developed by Gordon et al. that is designed to account explicitly for biases in the training process when extracting log likehoods.

After setting up plm-clust inputs as normal, run

make singleinferencevsperplexity

Biophysical Signals Analysis

Analyze biophysical signals that drive protein clustering to understand what factors influence clustering decisions.

Features

25 biophysical signals tested across 6 categories
Statistical rigor with ANOVA, effect sizes, and multiple validation metrics
Publication-quality visualizations (30+ figures generated)
Modular design for easy addition of new signals
Integration with uht-discovery project conventions

Key Findings

Primary discovery: The clustering primarily captures evolutionary relationships rather than functional characteristics
Top signals: N-glycosylation sites (η² = 0.659), signal peptide presence (η² = 0.635), isoelectric point (η² = 0.620)
Charge properties are the strongest drivers (η² = 0.620 average)

Usage

1. Prepare input data

Place your cluster FASTA files in inputs/biophysical_signals/clusters/:

inputs/biophysical_signals/
└── clusters/
    ├── cluster_0_of_6_bsgh11.fasta
    ├── cluster_1_of_6_bsgh11.fasta
    ├── cluster_2_of_6_bsgh11.fasta
    ├── cluster_3_of_6_bsgh11.fasta
    ├── cluster_4_of_6_bsgh11.fasta
    └── cluster_5_of_6_bsgh11.fasta

2. Run analysis

python run.py biophysical_signals

3. View results

Results are saved to results/biophysical_signals/:

Reports: reports/biophysical_signals_report.md
Figures: figures/biophysical_plots/, figures/statistical_plots/, etc.
Data: data/ (processed data files)

Configuration

Configure in config.yaml:

### biophysical_signals ###
biophysical_signals_project_directory: inputs/biophysical_signals/clusters

Or use environment variables:

export BIOPHYSICAL_SIGNALS_PROJECT_ID=/path/to/your/clusters
python run.py biophysical_signals

Adding New Signals

Create a new signal class inheriting from BiophysicalSignal
Implement the calculate method
Add to the framework in core/biophysical_signals_analysis.py
Run analysis: python run.py biophysical_signals

For detailed documentation, see BIOPHYSICAL_SIGNALS_README.md.

Sequence Classifier Module (iteration0classifier)

This module provides a fast, CNN+attention-based classifier for protein sequences, allowing rapid assignment of new sequences to clusters based on primary sequence alone. It is designed for high-throughput screening and is much faster than ESM2+clustering.

Features

Input: Primary sequence (padded/truncated to 1024, with start/end tokens)
Output: Cluster assignment + confidence score (softmax probability)
Special "unrelated" class populated by random sequences
QC: Removes non-amino-acid characters, truncates to 1024
Professional-grade: Publication-ready metrics, plots, and robust code
Dummy data/test script included

Directory Structure

training/iteration0classifier/
    dummy_training_data.csv      # Example training data
    dummy_headers.txt           # Example headers file
    dummy_test_sequences.fasta  # Example FASTA for prediction
    sequence_classifier.pth     # Trained model
    sequence_classifier_metadata.pkl # Model metadata
    evaluation_results.json     # Metrics/results
    training_history.png        # Training/validation loss/accuracy
    confusion_matrix.png        # Confusion matrix
    predictions.csv             # Predictions on new FASTA

Input Formats

Training CSV: Must have columns sequence and cluster, plus any experimental columns (optional)
Headers file: List of experimental column names (one per line)
FASTA: For prediction, standard FASTA format

Usage

1. Train the classifier

python -c "from core.classifier import run_classifier_training; run_classifier_training('training/iteration0classifier/dummy_training_data.csv', 'training/iteration0classifier/dummy_headers.txt')"

2. Predict on new sequences

python -c "from core.classifier import predict_from_fasta; predict_from_fasta('training/iteration0classifier/dummy_test_sequences.fasta')"

3. Run the test suite (dummy data)

python test_classifier.py

Notes

The model is a 1D CNN with attention, trained with early stopping and cross-validation.
The "unrelated" class is generated automatically from random sequences.
All results and plots are saved in training/iteration0classifier/.
For real data, place your CSV and headers file in the same directory and follow the same commands.

Phylogenetic Analysis

Perform comprehensive phylogenetic analysis comparing evolutionary relationships with protein language model embeddings.

Features

Multiple sequence alignment using MAFFT
Phylogenetic tree construction with distance matrix computation
ESM2 embedding analysis with cosine and dot product similarities
Correlation analysis between phylogenetic and embedding distances
Independent processing of multiple FASTA files with keepseparate mode

Usage

Setup

Place your FASTA files in inputs/phylo/PROJECT/:

inputs/phylo/cazy/
├── GH16_18_sequences_trimmed.fasta
├── GH43_19_sequences_trimmed.fasta
└── GH47_sequences_trimmed.fasta

Configuration

Configure in config.yaml:

### phylo ###
phylo_project_directory: cazy
phylo_keepseparate: true  # Process each FASTA independently

Run Analysis

./run.py phylo

Outputs

Results saved in results/phylo/PROJECT/:

Separate mode (when phylo_keepseparate: true):

Individual directories for each FASTA file
summary_log_PROJECT.txt with correlation metrics for all files
Per-file phylogenetic trees, distance matrices, and visualizations

Merged mode (when phylo_keepseparate: false):

Single analysis combining all FASTA files
Unified phylogenetic tree and analysis

Key Metrics

Pearson correlation between phylogenetic and embedding distances
R-squared values for correlation strength
Spearman correlation for non-linear relationships
P-values for statistical significance
Sample sizes for each comparison

Sequence Similarity Analysis

Comprehensive analysis of sequence similarity patterns using multiple metrics and visualization approaches.

Features

Multiple similarity metrics: Identity, BLOSUM62, PAM250, and more
Statistical analysis: Distribution analysis and clustering evaluation
Heatmap visualizations for similarity matrices
Unit-based processing for organized analysis of multiple sequence groups

Usage

Setup

Place your FASTA files in unit directories under inputs/sequence_similarity/PROJECT/:

inputs/sequence_similarity/cazy/
├── 10_clusters_GH16_18_sequences_trimmed/
│   ├── cluster_0.fasta
│   └── cluster_1.fasta
└── 8_clusters_GH90_sequences_trimmed/
    ├── cluster_0.fasta
    └── cluster_1.fasta

Configuration

### sequence_similarity ###
sequence_similarity_project_directory: cazy

Run Analysis

./run.py sequence_similarity

Outputs

Results in results/sequence_similarity/PROJECT/:

Individual unit directories with similarity matrices
Comprehensive heatmaps and distribution plots
Statistical summaries and CSV data files
Summary log with per-unit results

K-means Benchmarking

Rigorous assessment of k-means clustering reproducibility with statistical controls and algorithm comparisons.

Features

Reproducibility assessment across multiple k values and runs
Statistical controls: Random permutation and random data baselines
Algorithm comparison: K-means vs Spectral clustering
Publication-ready visualizations with proper 0-1 scaling
Independent processing with keepseparate mode

Usage

Setup

Place FASTA files in inputs/kmeansbenchmark/PROJECT/:

inputs/kmeansbenchmark/cazy/
├── GH16_18_sequences_trimmed.fasta
├── GH43_19_sequences_trimmed.fasta
└── GH47_sequences_trimmed.fasta

Configuration

### kmeansbenchmark ###
kmeansbenchmark_project_directory: cazy
kmeansbenchmark_k_range: [2, 10]
kmeansbenchmark_n_runs: 10
kmeansbenchmark_keepseparate: true  # Process each FASTA independently
kmeansbenchmark_run_controls: true
kmeansbenchmark_compare_algorithms: true

Run Analysis

./run.py kmeansbenchmark

Outputs

Results in results/kmeansbenchmark/PROJECT/:

Per FASTA file (when keepseparate: true):

Stability metrics and reproducibility heatmaps
Algorithm comparison plots (K-means vs Spectral)
Control experiment results
Comprehensive analysis reports

Key Visualizations:

Stability plots: ARI/NMI stability across k values (0-1 scaled)
Algorithm comparison: Performance comparison excluding deterministic hierarchical
Control baselines: Random permutation and random data controls
Reproducibility heatmaps: Pairwise clustering agreement matrices

Scientific Rigor

Fair controls: Random permutation control uses different permutations per replicate
Statistical validation: Multiple metrics (ARI, NMI, Jaccard) with significance testing
Reproducible results: Seeded random number generation for consistent results
Publication quality: Professional visualizations with proper scaling and clean layouts

Enhanced Module Features

Universal `keepseparate` Functionality

Multiple modules now support independent processing of FASTA files:

plmclustv2: Process each FASTA file separately with individual clustering results
phylogenetic_analysis: Independent phylogenetic analysis per FASTA file
kmeansbenchmark: Separate reproducibility assessment for each FASTA file
biophysical_signals: Unit-based analysis with aggregated feature importance
sequence_similarity: Unit-based similarity analysis

Auto-trimming in Trim Module

The trim module now supports automatic length threshold selection:

trim_project_directory: cazy
auto_mode: true  # Automatically determine length thresholds

Auto-trimming logic:

Finds most common sequence length (or average if tied)
Calculates standard deviation of lengths
Sets thresholds as reference_length ± 1 standard deviation
Provides detailed reporting of the automatic selection process

Enhanced Biophysical Signals

Aggregated feature importance plots across all processed units
Skip violin plots option for faster execution
Error bar visualization showing feature importance variability
CSV export of aggregated results for further analysis

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.11

Mar 18, 2026

0.2.10

Mar 16, 2026

0.2.9

Mar 16, 2026

0.2.7

Feb 13, 2026

This version

0.2.4

Feb 12, 2026

0.2.3

Feb 12, 2026

0.2.2

Feb 11, 2026

0.2.1

Feb 11, 2026

0.2.0

Feb 11, 2026

0.1.2

Jan 16, 2026

0.1.1

Jan 16, 2026

0.1.0

Dec 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uht_discovery-0.2.4.tar.gz (151.6 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uht_discovery-0.2.4-py3-none-any.whl (155.5 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file uht_discovery-0.2.4.tar.gz.

File metadata

Download URL: uht_discovery-0.2.4.tar.gz
Upload date: Feb 12, 2026
Size: 151.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for uht_discovery-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`4471f3f7f976507e8544c8791c78fd26995eb4a2132787dc7ebeabc25caf1961`
MD5	`6106344b974738854b7396e30e80fcdf`
BLAKE2b-256	`2ae18fe13ad509a4c5fae817d8f2ef8bad115db1abfda0948bf55bfb92522c74`

See more details on using hashes here.

File details

Details for the file uht_discovery-0.2.4-py3-none-any.whl.

File metadata

Download URL: uht_discovery-0.2.4-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 155.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for uht_discovery-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8e2ffced072c3d0426a265cee6111de5f5e8a4abe49e00c806692a8922eabe4`
MD5	`ef964822a936146d139f6d497980699b`
BLAKE2b-256	`e71b82655afeaa4d2c7178f60eba3e64770eb674906f8200e363965316a8715e`

See more details on using hashes here.

uht-discovery 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

UHT-Discovery: a workspace for pLM-clust

Installation

From Source

From PyPI

Quick Start

Interactive GUI (Recommended)

Web Deployment (Modal)

Command-Line Interface (CLI)

BLAST Sequence Search

Sequence Trimming

Protein Clustering (PLMCLUSTV2 - Recommended)

Legacy Clustering (PLMCLUST v1)

Other Tools

Unified CLI Entry Point

General Structure

blaster

trim_blaster

plm-clust

plmclustv2

Key Improvements

Setup

Usage

Outputs

Scoring Method

When to Use

optimizer

comprehensive_clustering

Setup

Usage

Outputs

Analysis Components

Key Metrics Analyzed

Example Results

mask-free-nllvsperplexity

saturationvsperplexity

singleinferencevsperplexity

Biophysical Signals Analysis

Features

Key Findings

Usage

1. Prepare input data

2. Run analysis

3. View results

Configuration

Adding New Signals

Sequence Classifier Module (iteration0classifier)

Features

Directory Structure

Input Formats

Usage

1. Train the classifier

2. Predict on new sequences

3. Run the test suite (dummy data)

Notes

Phylogenetic Analysis

Features

Usage

Setup

Configuration

Run Analysis

Outputs

Key Metrics

Sequence Similarity Analysis

Features

Usage

Setup

Configuration

Run Analysis

Outputs

K-means Benchmarking

Features

Universal `keepseparate` Functionality