Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR
Project description
hitlist
A curated, harmonized, ML-training-ready MHC ligand mass-spectrometry dataset.
hitlist ingests immunopeptidome data from IEDB, CEDAR, and paper supplementary tables (PRIDE/jPOSTrepo); filters out binding-assay data; joins every observation to expert-curated sample metadata (HLA genotype, tissue, disease, perturbation, instrument); and ships it as a single parquet file + pandas-friendly Python API.
As of v1.2.0 (human MHC-I, MS-eluted only):
| Unique peptides | 748,386 |
| Observations | 2,672,046 |
| Mono-allelic (exact allele) | 579,096 obs / 300K peptides / 119 alleles |
| Multi-allelic with allele match | 450K obs |
Multi-allelic with class-pool (N of M alleles) |
784K obs |
| Curated PMIDs | 155 (89.5% of all observations) |
Allele-resolved sample_mhc coverage |
74.8% |
Across all species and classes: 4.05M observations, 1.29M peptides, 21 species.
Install
pip install hitlist
Quick start for ML training
# One-time: register IEDB + CEDAR downloads and build the observations table
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv
hitlist data build # ~3 min, writes ~/.hitlist/observations.parquet
# Export training-ready CSVs
hitlist export observations --class I --species "Homo sapiens" --mono-allelic \
--min-allele-resolution four_digit -o mono_allelic_classI.csv
hitlist export observations --class II --species "Homo sapiens" \
-o multi_allelic_classII.csv
Binding-assay data (peptide microarrays, refolding, MEDi display) is excluded by default — the observations table contains only MS-eluted immunopeptidome data.
Python API
from hitlist.export import generate_observations_table
# Mono-allelic human class I: 579K observations with ground-truth allele
mono = generate_observations_table(
mhc_class="I",
species="Homo sapiens",
is_mono_allelic=True,
min_allele_resolution="four_digit",
)
# Multi-allelic with at least allele-pool info (74.8% of all rows)
multi = generate_observations_table(mhc_class="I", species="Homo sapiens")
multi_with_alleles = multi[multi["sample_mhc"].str.strip() != ""]
Species filters accept any variant — "Homo sapiens", "human", "homo_sapiens", "Homo sapiens (human)" all work.
Output schema
Each row of generate_observations_table() has (among others):
| Column | Meaning |
|---|---|
peptide |
Amino acid sequence |
mhc_restriction |
Allele from IEDB (may be "HLA class I" for multi-allelic studies) |
sample_mhc |
Allele(s) known for the sample the peptide came from — the useful field for training |
mhc_class |
I, II, or non classical |
mhc_species |
Canonical species (normalized via mhcgnomes) |
is_monoallelic |
True if sample has a single transfected allele (721.221, C1R, K562, MAPTAC…) |
has_peptide_level_allele |
True if mhc_restriction is a specific allele (not "HLA class I") |
is_potential_contaminant |
True for MS-eluted peptides that failed NetMHCpan binding prediction (supplementary only) |
sample_match_type |
How sample_mhc was populated (see below) |
matched_sample_count |
Number of curated samples for this PMID |
src_cancer, src_healthy_tissue, src_ebv_lcl, ... |
Mutually-exclusive biological source categories |
source |
iedb, cedar, or supplement |
source_organism, reference_title, cell_name, source_tissue, disease |
IEDB sample context |
instrument, instrument_type, acquisition_mode, fragmentation, labeling, ip_antibody |
MS acquisition from ms_samples curation |
sample_match_type — join provenance
| Value | Meaning | Training-grade? |
|---|---|---|
allele_match |
IEDB recorded a specific allele and it matched a curated sample genotype | Yes — high confidence |
single_sample_fallback |
IEDB class-only but study has exactly 1 sample, so sample_mhc = that sample's full genotype |
Yes (for deconvolution) |
pmid_class_pool |
IEDB class-only and study has multiple samples — sample_mhc = union of all class-matching alleles across all samples |
Yes (for deconvolution), lower precision |
unmatched |
No curated sample for this PMID, or all samples have mhc: unknown |
No — sample_mhc empty |
Curation layer
155 PMIDs curated in hitlist/data/pmid_overrides.yaml with per-sample HLA typing, tissue, perturbation, and instrument metadata. Supplementary data (PRIDE / paper tables) ingested via hitlist/data/supplementary.yaml — currently the full Gomez-Zepeda 2024 panel (JY, Raji, HeLa, SK-MEL-37, plasma).
Every observation is classified by mutually-exclusive biological source category:
| Category | Flag | Rule |
|---|---|---|
| Cancer | src_cancer |
Tumor tissue, cancer patient biofluids, or non-EBV cell lines |
| Adjacent to tumor | src_adjacent_to_tumor |
Surgically resected "normal" tissue (per-PMID override) |
| Activated APC | src_activated_apc |
Monocyte-derived DCs/macrophages with pharmacological activation |
| Healthy somatic | src_healthy_tissue |
Direct ex vivo, healthy donor, non-reproductive, non-thymic |
| Healthy thymus | src_healthy_thymus |
Direct ex vivo thymus (expected for CTAs, AIRE-mediated) |
| Healthy reproductive | src_healthy_reproductive |
Direct ex vivo testis, ovary (expected for CTAs) |
| EBV-LCL | src_ebv_lcl |
EBV-transformed B-cell lines |
| Cell line | src_cell_line |
Any cultured cell line |
Cancer-specific = src_cancer AND NOT src_healthy_tissue. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify a peptide from being cancer-specific.
Flanking context (for cleavage models)
from hitlist.proteome import ProteomeIndex
# Human proteome, 10aa flanks
idx = ProteomeIndex.from_ensembl(release=112)
flanking = idx.map_peptides(["SLLMWITQC", "GILGFVFTL"], flank=10)
# Or include viral/custom FASTAs
idx = ProteomeIndex.from_ensembl_plus_fastas(
release=112,
fasta_paths=["hpv16.fasta", "ebv.fasta", "influenza_a.fasta"],
)
Alternatively, build the observations table with flanking pre-computed:
# Auto-fetches reference proteomes for the species in the data (12 curated species
# + 14 curated viral proteomes). Each peptide maps against its own organism's
# proteome, not a pooled union.
hitlist data build --with-flanking
# For broader coverage (bacteria, parasites, rare viruses, plants) use
# --use-uniprot — resolved organisms are cached in the manifest.
hitlist data build --with-flanking --use-uniprot
# Just fetch proteomes without rebuilding observations
hitlist data fetch-proteomes --min-observations 100 --use-uniprot
hitlist data list-proteomes
This adds gene_name, gene_id, protein_id, position, n_flank, c_flank,
and flanking_species columns. A peptide recorded under a strain name
("Mycobacterium tuberculosis H37Rv") is resolved to its parent species' reference
proteome (UP000001584) and cached there — repeat lookups skip the network.
CLI reference
hitlist data register iedb /path/to/file.csv # register source
hitlist data build [--force] [--with-flanking] # build observations.parquet
hitlist data list # inventory + index cache status
hitlist data available # known datasets
hitlist data fetch hpv16 # auto-download viral proteome
hitlist export observations --class I --species human --mono-allelic \
--min-allele-resolution four_digit -o train.csv
hitlist export observations -o all.parquet # parquet output supported
hitlist export samples --class I # per-sample conditions (YAML curation only)
hitlist export summary # species x class summary
hitlist export counts --source merged # peptide counts per study
hitlist export alleles # validate YAML alleles with mhcgnomes
Filter flags on hitlist export observations
| Flag | Values |
|---|---|
--class |
I, II, non classical |
--species |
Any species variant (normalized via mhcgnomes) |
--mono-allelic / --multi-allelic |
Filter on is_monoallelic |
--instrument-type |
Orbitrap, timsTOF, TOF, QqQ, ... |
--acquisition-mode |
DDA, DIA, PRM |
--min-allele-resolution |
four_digit, two_digit, serological, class_only |
--output / -o |
.csv or .parquet |
Development
./develop.sh # install in dev mode
./format.sh # ruff format
./lint.sh # ruff check + format check
./test.sh # pytest with coverage
./deploy.sh # lint + test + build + upload to PyPI
See docs/pmid-curation.md for the curation YAML format and per-study overrides.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hitlist-1.4.1.tar.gz.
File metadata
- Download URL: hitlist-1.4.1.tar.gz
- Upload date:
- Size: 489.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
922bd572d727fefaa999a34649375e3998d043307c69cf1ed41c644719a29c0b
|
|
| MD5 |
69bff49d80368a61ca74ed0b68998dfd
|
|
| BLAKE2b-256 |
f716fd7c32200c64dcfa394deba1e9a53d0ad3116f89b43472f2ed5e4c5fbd7e
|
File details
Details for the file hitlist-1.4.1-py3-none-any.whl.
File metadata
- Download URL: hitlist-1.4.1-py3-none-any.whl
- Upload date:
- Size: 495.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc5b52f53ad58fcc4293b0dd823364f8054cc5cfa81700a3461cf1e5cab8723e
|
|
| MD5 |
bfb8b9c8787294ef70968c057b47bb25
|
|
| BLAKE2b-256 |
446fc80ecf9c41e8eb63dae0e4e9b361c9e706ed4819805880bf2ce1d0ad5fae
|