Semi-automated protein discovery pipeline using BLAST, quality control, and language model clustering
Project description
UHT-Discovery: a workspace for pLM-clust
This codebase is designed for semi-automated protein discovery, using BLAST tools for sequence collection, quality-control filtering, clustering using language models and in silico candidate selection. It is built around the plm-clust pipeline for enzyme discovery. See the preprint for more details.
Installation
From Source
# Clone the repository
git clone <repository-url>
cd uht-discovery-package
# Install the package
pip install -e .
From PyPI
pip install uht-discovery
Quick Start
Interactive GUI (Recommended)
Launch the interactive web interface:
uht-gui
Or as a Python module:
python -m uht_discovery.gui
The GUI will open in your browser at http://127.0.0.1:7860 and provides:
- BLASTER: Interactive BLAST sequence search with progress tracking
- TRIM: Visual length distribution and quality control
- PLMCLUSTV2: Interactive clustering with t-SNE/UMAP visualizations
Web Deployment (Modal)
Modal deployment lives in a separate private wrapper repo (e.g., uht-discovery-web).
That wrapper installs this package and exposes the NiceGUI app via create_fastapi_app().
Current production route is https://discovery.plmclust.co.uk via Cloudflare Worker -> Modal.
Command-Line Interface (CLI)
All tools are available as command-line utilities:
BLAST Sequence Search
uht-blast --project my_project --email your@email.com --hits 500
Sequence Trimming
uht-trim --project my_project --min-length 100 --max-length 500
Protein Clustering (PLMCLUSTV2 - Recommended)
uht-clust --project my_project --clusters 6
Legacy Clustering (PLMCLUST v1)
uht-clust-v1 --project my_project --clusters auto
Other Tools
uht-optimizer --project my_project
uht-mutation --project my_project
uht-comprehensive --project my_project
uht-biophysical --project my_project
uht-sequence --project my_project
uht-phylo --project my_project
uht-tsne --project my_project
uht-kmeans --project my_project
For detailed help on any command:
uht-blast --help
uht-trim --help
uht-clust --help
# etc.
Unified CLI Entry Point
You can also use the main entry point:
uht-discovery [command] [options]
Available commands: blast, trim, clust, clust-v1, optimizer, mutation, comprehensive, biophysical, sequence, phylo, tsne, kmeans, gui
General Structure
This code is set up so that configs and data are found in inputs/.../, where ... depends on the script that you care to run. Likewise, results are saved to results/.../
The currently available commands are:
uht-blast/uht-discovery blast- BLAST sequence searchuht-trim/uht-discovery trim- Sequence quality controluht-clust/uht-discovery clust- PLMCLUSTV2 clustering (recommended)uht-clust-v1/uht-discovery clust-v1- Legacy PLMCLUST clusteringuht-optimizer/uht-discovery optimizer- Mutation optimizationuht-mutation/uht-discovery mutation- Mutation testinguht-comprehensive/uht-discovery comprehensive- Comprehensive benchmarkinguht-biophysical/uht-discovery biophysical- Biophysical signals analysisuht-sequence/uht-discovery sequence- Sequence similarity analysisuht-phylo/uht-discovery phylo- Phylogenetic analysisuht-tsne/uht-discovery tsne- t-SNE visualizationuht-kmeans/uht-discovery kmeans- K-means benchmarkinguht-gui/uht-discovery gui- Launch interactive GUI
blaster
Retrieve sequences using NCBI's API
Very simply, place one or more .fasta sequences in inputs/blaster/PROJECT/
Where PROJECT is the name of your project. These sequences can be one, or multiple entries and/or fasta files.
You have several parameters to play with in config.yaml:
blaster_project_directory: PROJECT <- This is the name of the PROJECT to focus on. This allows you to hold multiple projects in inputs/ without BLASTing all of them.
blaster_num_hits: 500 <- Number of sequences to retrieve from NCBI
blaster_blast_db: nr <- database to search
blaster_evalue: 1e-5 <- e value cutoff
blaster_email: example@example.ac.uk <- your email (required to access NCBI's API)
You may then run using the CLI:
uht-blast --project PROJECT --email your@email.com
Or using the legacy run.py script:
./run.py blaster
The results will be saved in /results/blaster/PROJECT/... and they will be both the fasta file of hits, with the queries at the top of the list, and a BLAST report .txt file for traceability.
Hits are automatically de-duplicated.
trim_blaster
Perform basic (but important) quality control on the outputs of blaster.
After placing the BLASTer output (or indeed, any .fasta file), in inputs/trim/PROJECT/ where PROJECT is defined with trim_project_directory: in the config, simply run:
uht-trim --project PROJECT --min-length 100 --max-length 500
Or using the legacy run.py script:
./run.py trim
You will be presented with a histogram of gene lengths with the length of the top member of the .fasta (your first input sequence) indicated with a red line. Use this to select a sensible length cutoff range, which you then enter into the terminal after closing the plot. A trimmed output will be saved in results/trim_blaster/PROJECT/, with all non-amino acid entries (e.g. X) removed as well as any sequence failing the length criteria. A detailed report of the process and your selected thresholds will also be saved.
plm-clust
Cluster and rank the metagenome using ESM2
Place your .fasta files in inputs/PROJECT/...
Where you can name PROJECT anything you like. Specify the name PROJECT in plmclust_target_directory: PROJECT_NAME in config.yaml.
You can also set the number of clusters (plmclust_cluster_number) to any integer, or 'auto', which will automatically find an optimal number of clusters using silhouette scoring.
The perplexity scores contain a degree of randomness due to the random masking procedure - to account for this, you can set a number of replicates for this process, defined by plm_clust_replicates. You can set this to 0 if you are only interested in clustering and not scoring. If you elect to plm_clust_replicates >=3, you will also receive a standard deviation for the perplexity scores for each sequence.
Then, you can simply run using the CLI:
uht-clust-v1 --project PROJECT_NAME --clusters auto
Or using the legacy run.py script:
./run.py plmclust
Results will be found in /results/PROJECT/
These include a sub-directory containing a fasta file of each of the clusters, a .csv file pLM-clust_sequences_with_metrics_PROJECT.csv containing sequences, cluster identities, perplexities, and standard deviations.
The silhouette search will also be saved, along with coordinates to replicate the tSNE as a .csv file.
The top representative from each cluster is saved as top_cluster_representatives_PROJECT.fasta
plmclustv2
An improved version of plm-clust that uses single-pass scoring instead of masking-based scoring for better efficiency and consistency.
Key Improvements
- Single-pass scoring: Uses direct NLL computation instead of random masking
- Efficient execution: Computes embeddings and scores in one model pass
- Consistent results: No randomness from masking procedures
- Same interface: Maintains compatibility with plmclust v1
Setup
Place your .fasta files in inputs/plmclustv2/PROJECT/... where PROJECT is your project name.
Configure in config.yaml:
plmclustv2_project_directory: PROJECT_NAME
plmclustv2_cluster_number: 6 # or 'auto' for automatic selection
Usage
Using CLI (recommended):
uht-clust --project PROJECT_NAME --clusters 6
Using legacy run.py:
./run.py plmclustv2
Outputs
Results are saved in results/plmclustv2/PROJECT/:
pLM-clustv2_sequences_with_metrics_PROJECT.csv: Sequences with cluster assignments and NLL scorestop_cluster_representatives_PROJECT.fasta: Best representative from each clustertsne_coordinates_PROJECT.csv: t-SNE coordinates for visualizationtsne_clusters_PROJECT.png: t-SNE visualization plot6_clusters_PROJECT/: Individual cluster FASTA filesrun_log_PROJECT.txt: Detailed run log
Scoring Method
Unlike plmclust v1 which uses random masking, plmclustv2:
- Computes embeddings and scores in a single model pass
- Uses direct negative log-likelihood (NLL) scoring
- Provides consistent, non-random results
- Is significantly faster than the masking approach
When to Use
- Use plmclustv2 for: Faster execution, consistent results, single-pass efficiency
- Use plmclust v1 for: Legacy compatibility, masking-based scoring if specifically needed
optimizer
Leverage the outputs of plm-clust to probe a local fitness landscape, predicting mutants that will be improving when tested against many backgrounds in the cluster.
Optimise works by taking a .fasta file, then assessing each member for its average NLL according to ESM2. Taking the best sequence, we then build an MSA using ClustalO, and identify all single mutants present in the cluster relative to the reference. We then evaluate each single mutant in the context of a large number of homologues, taking the average effect of the mutation in the backgrounds as the fitness of the mutation. This is compared against the predicted effect of the mutation in the reference background. Improving mutations are then combined, and tested in the same way, up to a max_order.
Simply place a .fasta file in /inputs/optimizer/PROJECT/ and state the PROJECT in optimizer_project_directory.
Also set the max_sequences_to_de_denoise as the number of randomly-selected homologues to test each mutation in - a higher number will lead to better performance but slower speed. If you care about a specific reference sequence and don't want it auto-detected, set selected_sequence: to the fasta header of your sequence of interest.
Then run
./run.py optimizer
comprehensive_clustering
Perform comprehensive analysis of clustering results using PCA visualization and statistical comparison.
This module analyzes clustering results by comparing ESM2+kmeans clustering with MMseqs2 clustering, providing detailed visualizations and statistical insights into protein properties and clustering quality.
Setup
Place your cluster FASTA files in inputs/comprehensive_clustering/PROJECT/clusters/ where each file is named cluster_X_of_Y_PROJECT.fasta (e.g., cluster_0_of_6_bsgh11.fasta).
Configure the project in config.yaml:
comprehensive_clustering_project_directory: PROJECT_NAME
Usage
./run.py comprehensive_clustering
Outputs
The analysis generates comprehensive outputs in results/comprehensive_clustering/PROJECT/:
Figures:
figures/pca_plots/: PCA visualizations comparing clustering methodsfigures/violin_plots/: Distribution analysis for 16 protein properties by clusterfigures/summary_plots/: Summary comparison plots
Data:
data/pca_results.csv: PCA coordinates and cluster assignments
Reports:
reports/comprehensive_analysis_report.md: Detailed analysis report
Analysis Components
- PCA Analysis: Dimensionality reduction using 16 protein properties
- Cluster-Specific Violin Plots: Distribution analysis for each metric by cluster
- Feature Importance: Understanding which properties drive clustering
- Statistical Comparison: Quantitative comparison of clustering methods
- PCA-Clustering Agreement: Analysis of how well each method aligns with PCA structure
Key Metrics Analyzed
- Protein length, molecular weight, GRAVY score, isoelectric point
- Secondary structure content (helix, sheet, turn fractions)
- Amino acid composition patterns (hydrophobic, hydrophilic, charged, etc.)
- Charge at pH 7
Example Results
The analysis provides insights such as:
- Which clustering method better aligns with protein property structure
- Cluster-specific property distributions
- Feature importance in driving clustering decisions
- Statistical evidence of clustering quality differences
mask-free-nllvsperplexity
This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.
The other metric calculated here, mask-free-nll, runs a single-pass of the sequence through the model, and extracts the negative log liklihoods of each token, reporting the average.
This is compared against the 'perplexity' metric used by plm-clust, which masks 15% of the sequence at random and calculates the average NLL of the masked sequences. This process is repeated N times and the outputs averaged per sequence.
The output is a grid, with increasing replicates of masking from top to bottom and left to right, with the diagonal carrying the comparison to the mask-free-nll of each number of replicates. The numbers in the boxes represent the pearson R.
After setting up plm-clust inputs as normal, run
make mask-free-nllvsperplexity
saturationvsperplexity
This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.
The other metric calculated here, 'saturation', is the canonical way of calculating pseudo-NLL, carried out by masking each token sequentially and calculating the average NLL over each mask.
After setting up plm-clust inputs as normal, run
make saturationvsperplexity
singleinferencevsperplexity
This is a member of the *_vsperplexity suite, designed to test the method chosen for estimating perplexity in the plm-cluster method.
The other metric calculated here, 'singleinference', is a sinlge-pass method developed by Gordon et al. that is designed to account explicitly for biases in the training process when extracting log likehoods.
After setting up plm-clust inputs as normal, run
make singleinferencevsperplexity
Biophysical Signals Analysis
Analyze biophysical signals that drive protein clustering to understand what factors influence clustering decisions.
Features
- 25 biophysical signals tested across 6 categories
- Statistical rigor with ANOVA, effect sizes, and multiple validation metrics
- Publication-quality visualizations (30+ figures generated)
- Modular design for easy addition of new signals
- Integration with uht-discovery project conventions
Key Findings
- Primary discovery: The clustering primarily captures evolutionary relationships rather than functional characteristics
- Top signals: N-glycosylation sites (η² = 0.659), signal peptide presence (η² = 0.635), isoelectric point (η² = 0.620)
- Charge properties are the strongest drivers (η² = 0.620 average)
Usage
1. Prepare input data
Place your cluster FASTA files in inputs/biophysical_signals/clusters/:
inputs/biophysical_signals/
└── clusters/
├── cluster_0_of_6_bsgh11.fasta
├── cluster_1_of_6_bsgh11.fasta
├── cluster_2_of_6_bsgh11.fasta
├── cluster_3_of_6_bsgh11.fasta
├── cluster_4_of_6_bsgh11.fasta
└── cluster_5_of_6_bsgh11.fasta
2. Run analysis
python run.py biophysical_signals
3. View results
Results are saved to results/biophysical_signals/:
- Reports:
reports/biophysical_signals_report.md - Figures:
figures/biophysical_plots/,figures/statistical_plots/, etc. - Data:
data/(processed data files)
Configuration
Configure in config.yaml:
### biophysical_signals ###
biophysical_signals_project_directory: inputs/biophysical_signals/clusters
Or use environment variables:
export BIOPHYSICAL_SIGNALS_PROJECT_ID=/path/to/your/clusters
python run.py biophysical_signals
Adding New Signals
- Create a new signal class inheriting from
BiophysicalSignal - Implement the
calculatemethod - Add to the framework in
core/biophysical_signals_analysis.py - Run analysis:
python run.py biophysical_signals
For detailed documentation, see BIOPHYSICAL_SIGNALS_README.md.
Sequence Classifier Module (iteration0classifier)
This module provides a fast, CNN+attention-based classifier for protein sequences, allowing rapid assignment of new sequences to clusters based on primary sequence alone. It is designed for high-throughput screening and is much faster than ESM2+clustering.
Features
- Input: Primary sequence (padded/truncated to 1024, with start/end tokens)
- Output: Cluster assignment + confidence score (softmax probability)
- Special "unrelated" class populated by random sequences
- QC: Removes non-amino-acid characters, truncates to 1024
- Professional-grade: Publication-ready metrics, plots, and robust code
- Dummy data/test script included
Directory Structure
training/iteration0classifier/
dummy_training_data.csv # Example training data
dummy_headers.txt # Example headers file
dummy_test_sequences.fasta # Example FASTA for prediction
sequence_classifier.pth # Trained model
sequence_classifier_metadata.pkl # Model metadata
evaluation_results.json # Metrics/results
training_history.png # Training/validation loss/accuracy
confusion_matrix.png # Confusion matrix
predictions.csv # Predictions on new FASTA
Input Formats
- Training CSV: Must have columns
sequenceandcluster, plus any experimental columns (optional) - Headers file: List of experimental column names (one per line)
- FASTA: For prediction, standard FASTA format
Usage
1. Train the classifier
python -c "from core.classifier import run_classifier_training; run_classifier_training('training/iteration0classifier/dummy_training_data.csv', 'training/iteration0classifier/dummy_headers.txt')"
2. Predict on new sequences
python -c "from core.classifier import predict_from_fasta; predict_from_fasta('training/iteration0classifier/dummy_test_sequences.fasta')"
3. Run the test suite (dummy data)
python test_classifier.py
Notes
- The model is a 1D CNN with attention, trained with early stopping and cross-validation.
- The "unrelated" class is generated automatically from random sequences.
- All results and plots are saved in
training/iteration0classifier/. - For real data, place your CSV and headers file in the same directory and follow the same commands.
Phylogenetic Analysis
Perform comprehensive phylogenetic analysis comparing evolutionary relationships with protein language model embeddings.
Features
- Multiple sequence alignment using MAFFT
- Phylogenetic tree construction with distance matrix computation
- ESM2 embedding analysis with cosine and dot product similarities
- Correlation analysis between phylogenetic and embedding distances
- Independent processing of multiple FASTA files with
keepseparatemode
Usage
Setup
Place your FASTA files in inputs/phylo/PROJECT/:
inputs/phylo/cazy/
├── GH16_18_sequences_trimmed.fasta
├── GH43_19_sequences_trimmed.fasta
└── GH47_sequences_trimmed.fasta
Configuration
Configure in config.yaml:
### phylo ###
phylo_project_directory: cazy
phylo_keepseparate: true # Process each FASTA independently
Run Analysis
./run.py phylo
Outputs
Results saved in results/phylo/PROJECT/:
Separate mode (when phylo_keepseparate: true):
- Individual directories for each FASTA file
summary_log_PROJECT.txtwith correlation metrics for all files- Per-file phylogenetic trees, distance matrices, and visualizations
Merged mode (when phylo_keepseparate: false):
- Single analysis combining all FASTA files
- Unified phylogenetic tree and analysis
Key Metrics
- Pearson correlation between phylogenetic and embedding distances
- R-squared values for correlation strength
- Spearman correlation for non-linear relationships
- P-values for statistical significance
- Sample sizes for each comparison
Sequence Similarity Analysis
Comprehensive analysis of sequence similarity patterns using multiple metrics and visualization approaches.
Features
- Multiple similarity metrics: Identity, BLOSUM62, PAM250, and more
- Statistical analysis: Distribution analysis and clustering evaluation
- Heatmap visualizations for similarity matrices
- Unit-based processing for organized analysis of multiple sequence groups
Usage
Setup
Place your FASTA files in unit directories under inputs/sequence_similarity/PROJECT/:
inputs/sequence_similarity/cazy/
├── 10_clusters_GH16_18_sequences_trimmed/
│ ├── cluster_0.fasta
│ └── cluster_1.fasta
└── 8_clusters_GH90_sequences_trimmed/
├── cluster_0.fasta
└── cluster_1.fasta
Configuration
### sequence_similarity ###
sequence_similarity_project_directory: cazy
Run Analysis
./run.py sequence_similarity
Outputs
Results in results/sequence_similarity/PROJECT/:
- Individual unit directories with similarity matrices
- Comprehensive heatmaps and distribution plots
- Statistical summaries and CSV data files
- Summary log with per-unit results
K-means Benchmarking
Rigorous assessment of k-means clustering reproducibility with statistical controls and algorithm comparisons.
Features
- Reproducibility assessment across multiple k values and runs
- Statistical controls: Random permutation and random data baselines
- Algorithm comparison: K-means vs Spectral clustering
- Publication-ready visualizations with proper 0-1 scaling
- Independent processing with
keepseparatemode
Usage
Setup
Place FASTA files in inputs/kmeansbenchmark/PROJECT/:
inputs/kmeansbenchmark/cazy/
├── GH16_18_sequences_trimmed.fasta
├── GH43_19_sequences_trimmed.fasta
└── GH47_sequences_trimmed.fasta
Configuration
### kmeansbenchmark ###
kmeansbenchmark_project_directory: cazy
kmeansbenchmark_k_range: [2, 10]
kmeansbenchmark_n_runs: 10
kmeansbenchmark_keepseparate: true # Process each FASTA independently
kmeansbenchmark_run_controls: true
kmeansbenchmark_compare_algorithms: true
Run Analysis
./run.py kmeansbenchmark
Outputs
Results in results/kmeansbenchmark/PROJECT/:
Per FASTA file (when keepseparate: true):
- Stability metrics and reproducibility heatmaps
- Algorithm comparison plots (K-means vs Spectral)
- Control experiment results
- Comprehensive analysis reports
Key Visualizations:
- Stability plots: ARI/NMI stability across k values (0-1 scaled)
- Algorithm comparison: Performance comparison excluding deterministic hierarchical
- Control baselines: Random permutation and random data controls
- Reproducibility heatmaps: Pairwise clustering agreement matrices
Scientific Rigor
- Fair controls: Random permutation control uses different permutations per replicate
- Statistical validation: Multiple metrics (ARI, NMI, Jaccard) with significance testing
- Reproducible results: Seeded random number generation for consistent results
- Publication quality: Professional visualizations with proper scaling and clean layouts
Enhanced Module Features
Universal keepseparate Functionality
Multiple modules now support independent processing of FASTA files:
- plmclustv2: Process each FASTA file separately with individual clustering results
- phylogenetic_analysis: Independent phylogenetic analysis per FASTA file
- kmeansbenchmark: Separate reproducibility assessment for each FASTA file
- biophysical_signals: Unit-based analysis with aggregated feature importance
- sequence_similarity: Unit-based similarity analysis
Auto-trimming in Trim Module
The trim module now supports automatic length threshold selection:
trim_project_directory: cazy
auto_mode: true # Automatically determine length thresholds
Auto-trimming logic:
- Finds most common sequence length (or average if tied)
- Calculates standard deviation of lengths
- Sets thresholds as reference_length ± 1 standard deviation
- Provides detailed reporting of the automatic selection process
Enhanced Biophysical Signals
- Aggregated feature importance plots across all processed units
- Skip violin plots option for faster execution
- Error bar visualization showing feature importance variability
- CSV export of aggregated results for further analysis
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uht_discovery-0.2.3.tar.gz.
File metadata
- Download URL: uht_discovery-0.2.3.tar.gz
- Upload date:
- Size: 151.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f61fbb7e4361e817844bed13611e693bdae8660f7c8e03d8d4af768307b2c70c
|
|
| MD5 |
b61711ce8f22183b27451ee13594de47
|
|
| BLAKE2b-256 |
5c8a8aeb7f81a551d6d1e3cbee0e71827c47c95493a2920a51fa9cb573094536
|
File details
Details for the file uht_discovery-0.2.3-py3-none-any.whl.
File metadata
- Download URL: uht_discovery-0.2.3-py3-none-any.whl
- Upload date:
- Size: 155.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d39862e8582802f1addbda9cf04b6d3152980ddf35695d1437531ec7442eb06e
|
|
| MD5 |
6f5fc8398696f275325d4ef9670f34f8
|
|
| BLAKE2b-256 |
5db6939f9b6a1640d9fd00d17dc1b4a072b739009872bd530645761e6ccabf11
|