Semi-automated protein discovery pipeline using BLAST, quality control, and language model clustering
Project description
UHT Discovery
UHT Discovery is a protein discovery pipeline centered on three core steps:
blaster: retrieve homologous sequences from NCBItrim: quality-control and length-filter FASTA sequencesclust(PLMCLUSTV2): cluster sequences using ESM2 embeddings and NLL scoring
Installation
From Source
git clone <repository-url>
cd uht-discovery-package
pip install -e .
From PyPI
pip install uht-discovery
Quick Start
1. BLASTER
Put one or more query FASTA files in:
inputs/blaster/PROJECT/
Run:
uht-blast --project PROJECT --email your@email.com --hits 500
Key options:
--project: project directory name (required)--email: email for NCBI API usage (required)--hits: number of hits to retrieve (default:100)--db:nr,swissprot, orrefseq_protein(default:nr)--evalue: BLAST E-value cutoff (default:1e-5)
Outputs:
results/blaster/PROJECT/with combined FASTA and BLAST report
2. TRIM
Put FASTA files in:
inputs/trim/PROJECT/
Run (automatic thresholds):
uht-trim --project PROJECT --auto
Run (manual thresholds):
uht-trim --project PROJECT --low 100 --high 500
Key options:
--project: project directory name (required)--auto: infer thresholds automatically--low/--high: manual inclusive length thresholds
Outputs:
results/trim/PROJECT/with filtered FASTA files, removed-sequence logs, and QC plots
3. PLMCLUSTV2
Put FASTA files in:
inputs/plmclustv2/PROJECT/
Run:
uht-clust --project PROJECT --clusters auto
Or fixed cluster count:
uht-clust --project PROJECT --clusters 6
Key options:
--project: project directory name (required)--clusters: integer cluster count orauto--sil-min/--sil-max: silhouette search bounds for auto mode--keep-separate: process each FASTA file independently
Outputs:
results/plmclustv2/PROJECT/including cluster FASTAs, metrics CSVs, representative sequences, and visualization files
Recommended Workflow
Run the pipeline in this order:
uht-blastuht-trimuht-clust
Help
uht-blast --help
uht-trim --help
uht-clust --help
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uht_discovery-0.2.11.tar.gz.
File metadata
- Download URL: uht_discovery-0.2.11.tar.gz
- Upload date:
- Size: 138.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
646a512f830f6c928f1d5596b4bdc31a4f5b78ceba1823d0b7369505fdecc3d3
|
|
| MD5 |
18ccdec98a23801b0aee2fe53d0fa4b5
|
|
| BLAKE2b-256 |
98e56926baf182ea718ceef98f367dc226854376ce3866679fa36b70f48f4a0f
|
File details
Details for the file uht_discovery-0.2.11-py3-none-any.whl.
File metadata
- Download URL: uht_discovery-0.2.11-py3-none-any.whl
- Upload date:
- Size: 150.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fa3273d4d908490fed7f6db039eb78745d882fe97bb28684edb9f4c93398911
|
|
| MD5 |
a73fe66aac52ebab70526f2ede295993
|
|
| BLAKE2b-256 |
149e98bb2b2a665d2020958f65782ebb5894bd68abdc207f9483448094f65911
|