Skip to main content

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR

Project description

hitlist

Tests PyPI

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR.

hitlist scans IEDB and CEDAR MHC ligand exports, classifies each observation by biological source context (cancer tissue, healthy tissue, cell line, tumor-adjacent, etc.), maps peptides to source proteins with flanking sequences, and produces data quality reports. PMID-level curation overrides and tissue classifications are stored as YAML data files, not hardcoded Python.

Install

pip install hitlist

Quick start

# Register your IEDB/CEDAR downloads
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv

# Generate a data quality report
hitlist report
hitlist report --class I --output report.txt
from hitlist.scanner import scan
from hitlist.curation import classify_ms_row, is_cancer_specific
from hitlist.aggregate import aggregate_per_peptide

# Scan for specific peptides
hits = scan(
    peptides={"SLYNTVATL", "GILGFVFTL"},
    iedb_path="mhc_ligand_full.csv",
    mhc_class="I",
)

# Or profile the entire dataset
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv")

# Per-peptide summary with cancer-specific classification
summary = aggregate_per_peptide(hits)

Source classification

Every IEDB/CEDAR mass spec observation is classified into one of these categories:

Category Flag Rule
Cancer src_cancer Tumor tissue, cancer patient biofluids, or non-EBV cell lines
Adjacent to tumor src_adjacent_to_tumor Surgically resected "normal" tissue (per-PMID override)
Activated APC src_activated_apc Monocyte-derived DCs/macrophages with pharmacological activation
Healthy somatic src_healthy_tissue Direct ex vivo, healthy donor, non-reproductive, non-thymic
Healthy thymus src_healthy_thymus Direct ex vivo thymus (expected for CTAs, AIRE-mediated)
Healthy reproductive src_healthy_reproductive Direct ex vivo testis, ovary, etc. (expected for CTAs)
EBV-LCL src_ebv_lcl EBV-transformed B-cell lines
Cell line src_cell_line Any cultured cell line

Key rule: all non-EBV cell lines are classified as cancer-derived, even when IEDB labels them "No immunization". This catches HeLa, THP-1, A549, and other cancer lines used in non-cancer studies.

Cancer-specific = found in cancer AND NOT found in healthy somatic tissue. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify.

PMID curation overrides

Expert per-study overrides are stored in hitlist/data/pmid_overrides.yaml:

- pmid: 29557506
  label: "Neidert 2018  Tübingen/Zurich biobank"
  override: healthy
  tissue_overrides:
    Blood: healthy           # blood bank donors
    Bone Marrow: healthy     # hip arthroplasty
    Colon: adjacent          # Visceral Surgery dept → likely CRC margin
    Kidney: adjacent         # Urology → likely nephrectomy
    Liver: adjacent          # likely HCC/met adjacent

Tissue-level overrides take priority over study-level overrides.

Proteome mapping

Map peptides to source proteins with flanking context:

from hitlist.proteome import ProteomeIndex

# Human proteome (from pyensembl)
idx = ProteomeIndex.from_ensembl(release=112)

# Or combined human + viral
idx = ProteomeIndex.from_ensembl_plus_fastas(
    fasta_paths=["hpv16.fasta", "ebv.fasta"],
)

# Map peptides with 5-residue flanks
df = idx.map_peptides(["SLLMWITQC"], flank=5)
# → protein_id, gene_name, gene_id, position, n_flank, c_flank, n_sources, unique_n_flank, unique_c_flank

Per-sample peptidome context

The full peptidome context for each sample is critical for interpreting whether a peptide's presence is meaningful:

from hitlist.scanner import scan
from hitlist.samples import sample_peptidomes, overlay_targets

# Full scan (ALL peptides, not just targets)
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv", mhc_class="I")

# Per-sample stats
samples = sample_peptidomes(full)

# Overlay CTA peptides for context fractions
# "1 CTA out of 762 peptides = 0.13% = stochastic noise"
context = overlay_targets(full, target_peptides=my_cta_set, label="cta")

Data management

hitlist data available          # show all 14 known datasets
hitlist data fetch hpv16        # auto-download viral proteome from UniProt
hitlist data register iedb /path/to/file  # register manual download
hitlist data list               # show registered datasets with size/date
hitlist data info iedb          # detailed JSON metadata
hitlist data path iedb          # resolve to file path
hitlist data refresh hpv16      # re-download
hitlist data remove iedb        # unregister

Storage: ~/.hitlist/ (override with HITLIST_DATA_DIR env var).

Development

./develop.sh    # install in dev mode
./format.sh     # ruff format
./lint.sh       # ruff check + format check
./test.sh       # pytest with coverage
./deploy.sh     # lint + test + build + upload to PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hitlist-0.7.0.tar.gz (43.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hitlist-0.7.0-py3-none-any.whl (42.3 kB view details)

Uploaded Python 3

File details

Details for the file hitlist-0.7.0.tar.gz.

File metadata

  • Download URL: hitlist-0.7.0.tar.gz
  • Upload date:
  • Size: 43.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.7.0.tar.gz
Algorithm Hash digest
SHA256 d5a489ed96ff1de307e903556faf77fafe8ad83cc385657088c1c3f9f98685b8
MD5 beda8fd48095978c810c9c3c18a24b0b
BLAKE2b-256 02f02b3905936b00960d01374cf580c1958cafcc414056a5f5e28da058f0fc7c

See more details on using hashes here.

File details

Details for the file hitlist-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: hitlist-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 42.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2979f7c8354d076cf063e057f470f8b97f703b3c6f0daf283b8f268060dad74e
MD5 7a97ccb46674473fd9ab48a060ebb01f
BLAKE2b-256 a692e79f1d97c20ed241d6a3cfe4f93e64767de401318997cd06a3e28e298d7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page