Skip to main content

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR

Project description

hitlist

Tests PyPI

Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR.

hitlist scans IEDB and CEDAR MHC ligand exports, classifies each observation by biological source context (cancer tissue, healthy tissue, cell line, tumor-adjacent, etc.), maps peptides to source proteins with flanking sequences, and produces data quality reports. PMID-level curation overrides and tissue classifications are stored as YAML data files, not hardcoded Python.

Install

pip install hitlist

Quick start

# Register your IEDB/CEDAR downloads
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv

# Generate a data quality report
hitlist report
hitlist report --class I --output report.txt
from hitlist.scanner import scan
from hitlist.curation import classify_ms_row, is_cancer_specific
from hitlist.aggregate import aggregate_per_peptide

# Scan for specific peptides
hits = scan(
    peptides={"SLYNTVATL", "GILGFVFTL"},
    iedb_path="mhc_ligand_full.csv",
    mhc_class="I",
)

# Or profile the entire dataset
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv")

# Per-peptide summary with cancer-specific classification
summary = aggregate_per_peptide(hits)

Source classification

Every IEDB/CEDAR mass spec observation is classified into one of these categories:

Category Flag Rule
Cancer src_cancer Tumor tissue, cancer patient biofluids, or non-EBV cell lines
Adjacent to tumor src_adjacent_to_tumor Surgically resected "normal" tissue (per-PMID override)
Activated APC src_activated_apc Monocyte-derived DCs/macrophages with pharmacological activation
Healthy somatic src_healthy_tissue Direct ex vivo, healthy donor, non-reproductive, non-thymic
Healthy thymus src_healthy_thymus Direct ex vivo thymus (expected for CTAs, AIRE-mediated)
Healthy reproductive src_healthy_reproductive Direct ex vivo testis, ovary, etc. (expected for CTAs)
EBV-LCL src_ebv_lcl EBV-transformed B-cell lines
Cell line src_cell_line Any cultured cell line

Key rule: all non-EBV cell lines are classified as cancer-derived, even when IEDB labels them "No immunization". This catches HeLa, THP-1, A549, and other cancer lines used in non-cancer studies.

Cancer-specific = found in cancer AND NOT found in healthy somatic tissue. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify.

PMID curation overrides

Expert per-study overrides are stored in hitlist/data/pmid_overrides.yaml:

- pmid: 29557506
  label: "Neidert 2018  Tübingen/Zurich biobank"
  override: healthy
  tissue_overrides:
    Blood: healthy           # blood bank donors
    Bone Marrow: healthy     # hip arthroplasty
    Colon: adjacent          # Visceral Surgery dept → likely CRC margin
    Kidney: adjacent         # Urology → likely nephrectomy
    Liver: adjacent          # likely HCC/met adjacent

Tissue-level overrides take priority over study-level overrides.

Proteome mapping

Map peptides to source proteins with flanking context:

from hitlist.proteome import ProteomeIndex

# Human proteome (from pyensembl)
idx = ProteomeIndex.from_ensembl(release=112)

# Or combined human + viral
idx = ProteomeIndex.from_ensembl_plus_fastas(
    fasta_paths=["hpv16.fasta", "ebv.fasta"],
)

# Map peptides with 5-residue flanks
df = idx.map_peptides(["SLLMWITQC"], flank=5)
# → protein_id, gene_name, gene_id, position, n_flank, c_flank, n_sources, unique_n_flank, unique_c_flank

Per-sample peptidome context

The full peptidome context for each sample is critical for interpreting whether a peptide's presence is meaningful:

from hitlist.scanner import scan
from hitlist.samples import sample_peptidomes, overlay_targets

# Full scan (ALL peptides, not just targets)
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv", mhc_class="I")

# Per-sample stats
samples = sample_peptidomes(full)

# Overlay CTA peptides for context fractions
# "1 CTA out of 762 peptides = 0.13% = stochastic noise"
context = overlay_targets(full, target_peptides=my_cta_set, label="cta")

Data management

hitlist data available          # show all 14 known datasets
hitlist data fetch hpv16        # auto-download viral proteome from UniProt
hitlist data register iedb /path/to/file  # register manual download
hitlist data list               # show registered datasets with size/date
hitlist data info iedb          # detailed JSON metadata
hitlist data path iedb          # resolve to file path
hitlist data refresh hpv16      # re-download
hitlist data remove iedb        # unregister

Storage: ~/.hitlist/ (override with HITLIST_DATA_DIR env var).

Development

./develop.sh    # install in dev mode
./format.sh     # ruff format
./lint.sh       # ruff check + format check
./test.sh       # pytest with coverage
./deploy.sh     # lint + test + build + upload to PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hitlist-0.8.0.tar.gz (44.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hitlist-0.8.0-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file hitlist-0.8.0.tar.gz.

File metadata

  • Download URL: hitlist-0.8.0.tar.gz
  • Upload date:
  • Size: 44.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.8.0.tar.gz
Algorithm Hash digest
SHA256 8acf4fe0095f5d226ec520bf50402f43c905bf9fb0ca96d908b3c63858ecb0b5
MD5 ce6150a0263733918572241698e7bda7
BLAKE2b-256 5e2dd799385dc4b8afea24081a18941e9060694b1ac52af5097f0e075f965ff8

See more details on using hashes here.

File details

Details for the file hitlist-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: hitlist-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de0e51c550db6e183b94283b0626326c13f0d3311fe70ea1ed15a5f6e0215607
MD5 70c64c0953649ff4fc6d114db288e08f
BLAKE2b-256 05a1be97059637099f1a7d5fa6bb1f23d80cef800d229c3c18c46d14ff58cb7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page