Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR
Project description
hitlist
Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR.
hitlist scans IEDB and CEDAR MHC ligand exports, classifies each observation by biological source context (cancer tissue, healthy tissue, cell line, tumor-adjacent, etc.), maps peptides to source proteins with flanking sequences, and produces data quality reports. PMID-level curation overrides and tissue classifications are stored as YAML data files, not hardcoded Python.
Install
pip install hitlist
Quick start
# Register your IEDB/CEDAR downloads
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv
# Generate a data quality report
hitlist report
hitlist report --class I --output report.txt
from hitlist.scanner import scan
from hitlist.curation import classify_ms_row, is_cancer_specific
from hitlist.aggregate import aggregate_per_peptide
# Scan for specific peptides
hits = scan(
peptides={"SLYNTVATL", "GILGFVFTL"},
iedb_path="mhc_ligand_full.csv",
mhc_class="I",
)
# Or profile the entire dataset
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv")
# Per-peptide summary with cancer-specific classification
summary = aggregate_per_peptide(hits)
Source classification
Every IEDB/CEDAR mass spec observation is classified into one of these categories:
| Category | Flag | Rule |
|---|---|---|
| Cancer | src_cancer |
Tumor tissue, cancer patient biofluids, or non-EBV cell lines |
| Adjacent to tumor | src_adjacent_to_tumor |
Surgically resected "normal" tissue (per-PMID override) |
| Activated APC | src_activated_apc |
Monocyte-derived DCs/macrophages with pharmacological activation |
| Healthy somatic | src_healthy_tissue |
Direct ex vivo, healthy donor, non-reproductive, non-thymic |
| Healthy thymus | src_healthy_thymus |
Direct ex vivo thymus (expected for CTAs, AIRE-mediated) |
| Healthy reproductive | src_healthy_reproductive |
Direct ex vivo testis, ovary, etc. (expected for CTAs) |
| EBV-LCL | src_ebv_lcl |
EBV-transformed B-cell lines |
| Cell line | src_cell_line |
Any cultured cell line |
Key rule: all non-EBV cell lines are classified as cancer-derived, even when IEDB labels them "No immunization". This catches HeLa, THP-1, A549, and other cancer lines used in non-cancer studies.
Cancer-specific = found in cancer AND NOT found in healthy somatic tissue. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify.
PMID curation overrides
Expert per-study overrides are stored in hitlist/data/pmid_overrides.yaml:
- pmid: 29557506
label: "Neidert 2018 — Tübingen/Zurich biobank"
override: healthy
tissue_overrides:
Blood: healthy # blood bank donors
Bone Marrow: healthy # hip arthroplasty
Colon: adjacent # Visceral Surgery dept → likely CRC margin
Kidney: adjacent # Urology → likely nephrectomy
Liver: adjacent # likely HCC/met adjacent
Tissue-level overrides take priority over study-level overrides.
Proteome mapping
Map peptides to source proteins with flanking context:
from hitlist.proteome import ProteomeIndex
# Human proteome (from pyensembl)
idx = ProteomeIndex.from_ensembl(release=112)
# Or combined human + viral
idx = ProteomeIndex.from_ensembl_plus_fastas(
fasta_paths=["hpv16.fasta", "ebv.fasta"],
)
# Map peptides with 5-residue flanks
df = idx.map_peptides(["SLLMWITQC"], flank=5)
# → protein_id, gene_name, gene_id, position, n_flank, c_flank, n_sources, unique_n_flank, unique_c_flank
Per-sample peptidome context
The full peptidome context for each sample is critical for interpreting whether a peptide's presence is meaningful:
from hitlist.scanner import scan
from hitlist.samples import sample_peptidomes, overlay_targets
# Full scan (ALL peptides, not just targets)
full = scan(peptides=None, iedb_path="mhc_ligand_full.csv", mhc_class="I")
# Per-sample stats
samples = sample_peptidomes(full)
# Overlay CTA peptides for context fractions
# "1 CTA out of 762 peptides = 0.13% = stochastic noise"
context = overlay_targets(full, target_peptides=my_cta_set, label="cta")
Data management
hitlist data available # show all 14 known datasets
hitlist data fetch hpv16 # auto-download viral proteome from UniProt
hitlist data register iedb /path/to/file # register manual download
hitlist data list # show registered datasets with size/date
hitlist data info iedb # detailed JSON metadata
hitlist data path iedb # resolve to file path
hitlist data refresh hpv16 # re-download
hitlist data remove iedb # unregister
Storage: ~/.hitlist/ (override with HITLIST_DATA_DIR env var).
Development
./develop.sh # install in dev mode
./format.sh # ruff format
./lint.sh # ruff check + format check
./test.sh # pytest with coverage
./deploy.sh # lint + test + build + upload to PyPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hitlist-0.6.0.tar.gz.
File metadata
- Download URL: hitlist-0.6.0.tar.gz
- Upload date:
- Size: 42.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0618139a67bf0925258327839731ae5c329f59db490a8bc4af8a7bdabf08467a
|
|
| MD5 |
29d9bc55d06613927b6869e6dc814862
|
|
| BLAKE2b-256 |
b95b46f372c0ab06a5814051e1466c373eda90c3da89c026d60d7f81319d97fc
|
File details
Details for the file hitlist-0.6.0-py3-none-any.whl.
File metadata
- Download URL: hitlist-0.6.0-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d6e37e54cb6374136881f71e164278512a1928ea0fffbc951ad547a4dfa1694
|
|
| MD5 |
a5bdeeeb5b0a1dac5004f89c7df80948
|
|
| BLAKE2b-256 |
68592984e3449ee2a08b224d2cad2a72fb1b1146de1345e0fd4ec4e05a13a442
|