Skip to main content

Semi-automated protein discovery pipeline using BLAST, quality control, and language model clustering

Project description

UHT Discovery

UHT Discovery is a protein discovery pipeline centered on three core steps:

  1. blaster: retrieve homologous sequences from NCBI
  2. trim: quality-control and length-filter FASTA sequences
  3. clust (PLMCLUSTV2): cluster sequences using ESM2 embeddings and NLL scoring

Installation

From Source

git clone <repository-url>
cd uht-discovery-package
pip install -e .

From PyPI

pip install uht-discovery

Quick Start

1. BLASTER

Put one or more query FASTA files in:

inputs/blaster/PROJECT/

Run:

uht-blast --project PROJECT --email your@email.com --hits 500

Key options:

  • --project: project directory name (required)
  • --email: email for NCBI API usage (required)
  • --hits: number of hits to retrieve (default: 100)
  • --db: nr, swissprot, or refseq_protein (default: nr)
  • --evalue: BLAST E-value cutoff (default: 1e-5)

Outputs:

  • results/blaster/PROJECT/ with combined FASTA and BLAST report

2. TRIM

Put FASTA files in:

inputs/trim/PROJECT/

Run (automatic thresholds):

uht-trim --project PROJECT --auto

Run (manual thresholds):

uht-trim --project PROJECT --low 100 --high 500

Key options:

  • --project: project directory name (required)
  • --auto: infer thresholds automatically
  • --low / --high: manual inclusive length thresholds

Outputs:

  • results/trim/PROJECT/ with filtered FASTA files, removed-sequence logs, and QC plots

3. PLMCLUSTV2

Put FASTA files in:

inputs/plmclustv2/PROJECT/

Run:

uht-clust --project PROJECT --clusters auto

Or fixed cluster count:

uht-clust --project PROJECT --clusters 6

Key options:

  • --project: project directory name (required)
  • --clusters: integer cluster count or auto
  • --sil-min / --sil-max: silhouette search bounds for auto mode
  • --keep-separate: process each FASTA file independently

Outputs:

  • results/plmclustv2/PROJECT/ including cluster FASTAs, metrics CSVs, representative sequences, and visualization files

Recommended Workflow

Run the pipeline in this order:

  1. uht-blast
  2. uht-trim
  3. uht-clust

Help

uht-blast --help
uht-trim --help
uht-clust --help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uht_discovery-0.2.11.tar.gz (138.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uht_discovery-0.2.11-py3-none-any.whl (150.3 kB view details)

Uploaded Python 3

File details

Details for the file uht_discovery-0.2.11.tar.gz.

File metadata

  • Download URL: uht_discovery-0.2.11.tar.gz
  • Upload date:
  • Size: 138.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for uht_discovery-0.2.11.tar.gz
Algorithm Hash digest
SHA256 646a512f830f6c928f1d5596b4bdc31a4f5b78ceba1823d0b7369505fdecc3d3
MD5 18ccdec98a23801b0aee2fe53d0fa4b5
BLAKE2b-256 98e56926baf182ea718ceef98f367dc226854376ce3866679fa36b70f48f4a0f

See more details on using hashes here.

File details

Details for the file uht_discovery-0.2.11-py3-none-any.whl.

File metadata

  • Download URL: uht_discovery-0.2.11-py3-none-any.whl
  • Upload date:
  • Size: 150.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for uht_discovery-0.2.11-py3-none-any.whl
Algorithm Hash digest
SHA256 4fa3273d4d908490fed7f6db039eb78745d882fe97bb28684edb9f4c93398911
MD5 a73fe66aac52ebab70526f2ede295993
BLAKE2b-256 149e98bb2b2a665d2020958f65782ebb5894bd68abdc207f9483448094f65911

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page