Skip to main content

HuggingFace-based inference and evaluation library for TCR-pMHC sequence translation models

Project description

hf-tcr

HuggingFace-based inference and evaluation library for TCR-pMHC sequence translation models.

Installation

pip install hf-tcr

Or install from source:

git clone https://github.com/pirl-unc/hf-tcr.git
cd hf-tcr
pip install .

Quick Start

Loading Data

from hf_tcr import TCRpMHCdataset

# Create dataset for pMHC -> TCR translation
dataset = TCRpMHCdataset(
    source="pmhc",
    target="tcr",
    use_pseudo=True,
    use_cdr3=True
)

# Load from CSV file
dataset.load_data_from_file("path/to/data.csv")

Running Inference

from hf_tcr import HuggingFaceModelAdapter, TCRBartTokenizer
from transformers import BartForConditionalGeneration

# Load your trained model and tokenizer
tokenizer = TCRBartTokenizer()
model = BartForConditionalGeneration.from_pretrained("path/to/model")

# Create adapter
adapter = HuggingFaceModelAdapter(
    hf_tokenizer=tokenizer,
    hf_model=model,
    device="cuda"
)

# Get a source from your dataset
source = dataset[0][0]  # Get source from first example

# Generate translations
translations = adapter.sample_translations(
    source=source,
    n=10,
    max_len=25,
    mode="top_k",
    top_k=50,
    temperature=1.0
)

Evaluating Models

from hf_tcr import ModelEvaluator

# Create evaluator (extends HuggingFaceModelAdapter)
evaluator = ModelEvaluator(
    hf_tokenizer=tokenizer,
    hf_model=model,
    device="cuda"
)

# Compute dataset-level metrics
metrics = evaluator.dataset_metrics_at_k(
    dataset=dataset,
    k=100,
    max_len=25,
    mode="top_k",
    top_k=50
)

print(f"BLEU: {metrics['char-bleu']:.4f}")
print(f"Precision@100: {metrics['precision']:.4f}")
print(f"Recall@100: {metrics['recall']:.4f}")
print(f"F1@100: {metrics['f1']:.4f}")
print(f"Mean Edit Distance: {metrics['d_edit']:.2f}")
print(f"Sequence Recovery: {metrics['seq_recovery']:.4f}")
print(f"Diversity: {metrics['diversity']:.4f}")
print(f"Perplexity: {metrics['perplexity']:.2f}")

Available Decoding Strategies

The adapter supports multiple decoding strategies:

  • greedy: Deterministic greedy decoding
  • ancestral: Multinomial sampling
  • top_k: Top-k sampling with temperature
  • top_p: Nucleus (top-p) sampling
  • beam: Deterministic beam search
  • stochastic_beam: Stochastic beam search
  • diverse_beam: Diverse beam search
  • contrastive: Contrastive decoding
  • typical: Typical sampling

Metrics

The ModelEvaluator provides the following metrics:

  • Char-BLEU: Character-level BLEU score
  • Precision@K: Fraction of generated sequences that match references
  • Recall@K: Fraction of reference sequences recovered
  • F1@K: Harmonic mean of precision and recall
  • Mean Edit Distance: Average Levenshtein distance to closest reference
  • Sequence Recovery: Position-wise match percentage
  • Diversity: Ratio of unique to total generated sequences
  • Perplexity: Model perplexity on the dataset

Data Format

CSV files should contain the following columns:

Required:

  • CDR3b: CDR3 beta sequence
  • TRBV: TRBV gene (IMGT format)
  • TRBJ: TRBJ gene (IMGT format)
  • Epitope: Peptide sequence
  • Allele: HLA allele
  • Reference: Data source reference

Optional:

  • CDR3a, TRAV, TRAJ, TRAD, TRBD
  • TRA_stitched, TRB_stitched
  • Pseudo, MHC

Dependencies

  • torch >= 2.0.0
  • transformers >= 4.30.0
  • numpy, pandas, tqdm
  • python-Levenshtein
  • nltk
  • einops
  • tidytcells >= 2.0.0
  • mhcgnomes >= 1.8.0
  • tcrpmhcdataset >= 0.2.0

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_tcr-0.2.1.tar.gz (167.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_tcr-0.2.1-py3-none-any.whl (158.7 kB view details)

Uploaded Python 3

File details

Details for the file hf_tcr-0.2.1.tar.gz.

File metadata

  • Download URL: hf_tcr-0.2.1.tar.gz
  • Upload date:
  • Size: 167.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for hf_tcr-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0baec15af214a451ee261f49a6e9bf637fc1ba206e7edc0f76c20106c60f6ebd
MD5 5039486b309aa1723c2f3c5f113d72e2
BLAKE2b-256 bfa839edb094f633cee723c852c7441588beec46ef54f0c8136edda5e096e17c

See more details on using hashes here.

File details

Details for the file hf_tcr-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: hf_tcr-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 158.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for hf_tcr-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 65cfd4b2fa57bf7e0b45096a13f017b2ce6de50145d5a1fae546342c1c20a504
MD5 aa92c33f223983df40298f057d2a0d53
BLAKE2b-256 d129ad6a7439af0392c001a8689021496291625227afa55f578576ac40d0d000

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page