Skip to main content

HuggingFace-based inference and evaluation library for TCR-pMHC sequence translation models

Project description

hf-tcr

HuggingFace-based inference and evaluation library for TCR-pMHC sequence translation models.

Installation

pip install hf-tcr

Or install from source:

git clone https://github.com/pirl-unc/hf-tcr.git
cd hf-tcr
pip install .

Quick Start

Loading Data

from hf_tcr import TCRpMHCdataset

# Create dataset for pMHC -> TCR translation
dataset = TCRpMHCdataset(
    source="pmhc",
    target="tcr",
    use_pseudo=True,
    use_cdr3=True
)

# Load from CSV file
dataset.load_data_from_file("path/to/data.csv")

Running Inference

from hf_tcr import HuggingFaceModelAdapter, TCRBartTokenizer
from transformers import BartForConditionalGeneration

# Load your trained model and tokenizer
tokenizer = TCRBartTokenizer()
model = BartForConditionalGeneration.from_pretrained("path/to/model")

# Create adapter
adapter = HuggingFaceModelAdapter(
    hf_tokenizer=tokenizer,
    hf_model=model,
    device="cuda"
)

# Get a source from your dataset
source = dataset[0][0]  # Get source from first example

# Generate translations
translations = adapter.sample_translations(
    source=source,
    n=10,
    max_len=25,
    mode="top_k",
    top_k=50,
    temperature=1.0
)

Evaluating Models

from hf_tcr import ModelEvaluator

# Create evaluator (extends HuggingFaceModelAdapter)
evaluator = ModelEvaluator(
    hf_tokenizer=tokenizer,
    hf_model=model,
    device="cuda"
)

# Compute dataset-level metrics
metrics = evaluator.dataset_metrics_at_k(
    dataset=dataset,
    k=100,
    max_len=25,
    mode="top_k",
    top_k=50
)

print(f"BLEU: {metrics['char-bleu']:.4f}")
print(f"Precision@100: {metrics['precision']:.4f}")
print(f"Recall@100: {metrics['recall']:.4f}")
print(f"F1@100: {metrics['f1']:.4f}")
print(f"Mean Edit Distance: {metrics['d_edit']:.2f}")
print(f"Sequence Recovery: {metrics['seq_recovery']:.4f}")
print(f"Diversity: {metrics['diversity']:.4f}")
print(f"Perplexity: {metrics['perplexity']:.2f}")

Available Decoding Strategies

The adapter supports multiple decoding strategies:

  • greedy: Deterministic greedy decoding
  • ancestral: Multinomial sampling
  • top_k: Top-k sampling with temperature
  • top_p: Nucleus (top-p) sampling
  • beam: Deterministic beam search
  • stochastic_beam: Stochastic beam search
  • diverse_beam: Diverse beam search
  • contrastive: Contrastive decoding
  • typical: Typical sampling

Metrics

The ModelEvaluator provides the following metrics:

  • Char-BLEU: Character-level BLEU score
  • Precision@K: Fraction of generated sequences that match references
  • Recall@K: Fraction of reference sequences recovered
  • F1@K: Harmonic mean of precision and recall
  • Mean Edit Distance: Average Levenshtein distance to closest reference
  • Sequence Recovery: Position-wise match percentage
  • Diversity: Ratio of unique to total generated sequences
  • Perplexity: Model perplexity on the dataset

Data Format

CSV files should contain the following columns:

Required:

  • CDR3b: CDR3 beta sequence
  • TRBV: TRBV gene (IMGT format)
  • TRBJ: TRBJ gene (IMGT format)
  • Epitope: Peptide sequence
  • Allele: HLA allele
  • Reference: Data source reference

Optional:

  • CDR3a, TRAV, TRAJ, TRAD, TRBD
  • TRA_stitched, TRB_stitched
  • Pseudo, MHC

Dependencies

  • torch >= 2.0.0
  • transformers >= 4.30.0
  • numpy, pandas, tqdm
  • python-Levenshtein
  • nltk
  • einops
  • tidytcells >= 2.0.0
  • mhcgnomes >= 1.8.0
  • tcrpmhcdataset >= 0.2.0

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_tcr-0.1.5.tar.gz (167.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_tcr-0.1.5-py3-none-any.whl (158.7 kB view details)

Uploaded Python 3

File details

Details for the file hf_tcr-0.1.5.tar.gz.

File metadata

  • Download URL: hf_tcr-0.1.5.tar.gz
  • Upload date:
  • Size: 167.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for hf_tcr-0.1.5.tar.gz
Algorithm Hash digest
SHA256 1aaec45a710962882bbed38b316acf6f5e8a6e0883f44b9c4fcb9195c2c6c955
MD5 63746865f43aa0bac7318230d34eb3d7
BLAKE2b-256 df5228c04437728828358a87c77ab4c4091b4574ad5d3cc90f8a428a3f743260

See more details on using hashes here.

File details

Details for the file hf_tcr-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: hf_tcr-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 158.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for hf_tcr-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b638eec0dd1fce183cbbb566216c840a111c987ee0161bf60f80e1b53d52754b
MD5 488b6f899392e1d9a6bbe5a89c6e7e4b
BLAKE2b-256 0bf367a360d7484270d544214a2c7f4d35288d22552417eff189d14036cdf8b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page