Skip to main content

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR

Project description

hitlist

Tests PyPI

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR.

hitlist scans IEDB and CEDAR MHC ligand exports, classifies each observation by biological source context (cancer tissue, healthy tissue, cell line, tumor-adjacent, etc.), maps peptides to source proteins with flanking sequences, and produces data quality reports. PMID-level curation overrides and tissue classifications are stored as YAML data files, not hardcoded Python.

Install

pip install hitlist

Quick start

# Register your IEDB/CEDAR downloads
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv

# Generate a data quality report
hitlist report
hitlist report --class I --output report.txt
from hitlist.scanner import scan
from hitlist.curation import classify_ms_row, is_cancer_specific
from hitlist.aggregate import aggregate_per_peptide

# Scan for specific peptides
hits = scan(
    peptides={"SLYNTVATL", "GILGFVFTL"},
    iedb_path="mhc_ligand_full.csv",
    mhc_class="I",
)

# Or profile the entire dataset
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv")

# Per-peptide summary with cancer-specific classification
summary = aggregate_per_peptide(hits)

Source classification

Every IEDB/CEDAR mass spec observation is classified into one of these categories:

Category Flag Rule
Cancer src_cancer Tumor tissue, cancer patient biofluids, or non-EBV cell lines
Adjacent to tumor src_adjacent_to_tumor Surgically resected "normal" tissue (per-PMID override)
Activated APC src_activated_apc Monocyte-derived DCs/macrophages with pharmacological activation
Healthy somatic src_healthy_tissue Direct ex vivo, healthy donor, non-reproductive, non-thymic
Healthy thymus src_healthy_thymus Direct ex vivo thymus (expected for CTAs, AIRE-mediated)
Healthy reproductive src_healthy_reproductive Direct ex vivo testis, ovary, etc. (expected for CTAs)
EBV-LCL src_ebv_lcl EBV-transformed B-cell lines
Cell line src_cell_line Any cultured cell line

Key rule: all non-EBV cell lines are classified as cancer-derived, even when IEDB labels them "No immunization". This catches HeLa, THP-1, A549, and other cancer lines used in non-cancer studies.

Cancer-specific = found in cancer AND NOT found in healthy somatic tissue. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify.

PMID curation overrides

Expert per-study overrides are stored in hitlist/data/pmid_overrides.yaml:

- pmid: 29557506
  label: "Neidert 2018  Tübingen/Zurich biobank"
  override: healthy
  tissue_overrides:
    Blood: healthy           # blood bank donors
    Bone Marrow: healthy     # hip arthroplasty
    Colon: adjacent          # Visceral Surgery dept → likely CRC margin
    Kidney: adjacent         # Urology → likely nephrectomy
    Liver: adjacent          # likely HCC/met adjacent

Tissue-level overrides take priority over study-level overrides.

Proteome mapping

Map peptides to source proteins with flanking context:

from hitlist.proteome import ProteomeIndex

# Human proteome (from pyensembl)
idx = ProteomeIndex.from_ensembl(release=112)

# Or combined human + viral
idx = ProteomeIndex.from_ensembl_plus_fastas(
    fasta_paths=["hpv16.fasta", "ebv.fasta"],
)

# Map peptides with 5-residue flanks
df = idx.map_peptides(["SLLMWITQC"], flank=5)
# → protein_id, gene_name, gene_id, position, n_flank, c_flank, n_sources, unique_n_flank, unique_c_flank

Per-sample peptidome context

The full peptidome context for each sample is critical for interpreting whether a peptide's presence is meaningful:

from hitlist.scanner import scan
from hitlist.samples import sample_peptidomes, overlay_targets

# Full scan (ALL peptides, not just targets)
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv", mhc_class="I")

# Per-sample stats
samples = sample_peptidomes(full)

# Overlay CTA peptides for context fractions
# "1 CTA out of 762 peptides = 0.13% = stochastic noise"
context = overlay_targets(full, target_peptides=my_cta_set, label="cta")

Data management

hitlist data available          # show all 14 known datasets
hitlist data fetch hpv16        # auto-download viral proteome from UniProt
hitlist data register iedb /path/to/file  # register manual download
hitlist data list               # show registered datasets with size/date
hitlist data info iedb          # detailed JSON metadata
hitlist data path iedb          # resolve to file path
hitlist data refresh hpv16      # re-download
hitlist data remove iedb        # unregister

Storage: ~/.hitlist/ (override with HITLIST_DATA_DIR env var).

Development

./develop.sh    # install in dev mode
./format.sh     # ruff format
./lint.sh       # ruff check + format check
./test.sh       # pytest with coverage
./deploy.sh     # lint + test + build + upload to PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hitlist-0.6.0.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hitlist-0.6.0-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file hitlist-0.6.0.tar.gz.

File metadata

  • Download URL: hitlist-0.6.0.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.6.0.tar.gz
Algorithm Hash digest
SHA256 0618139a67bf0925258327839731ae5c329f59db490a8bc4af8a7bdabf08467a
MD5 29d9bc55d06613927b6869e6dc814862
BLAKE2b-256 b95b46f372c0ab06a5814051e1466c373eda90c3da89c026d60d7f81319d97fc

See more details on using hashes here.

File details

Details for the file hitlist-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: hitlist-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d6e37e54cb6374136881f71e164278512a1928ea0fffbc951ad547a4dfa1694
MD5 a5bdeeeb5b0a1dac5004f89c7df80948
BLAKE2b-256 68592984e3449ee2a08b224d2cad2a72fb1b1146de1345e0fd4ec4e05a13a442

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page