Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

hitlist

A carefully curated and harmonized source of truth for MHC ligand mass spectrometry data, built for pMHC target selection and model training.

hitlist ingests immunopeptidome data from IEDB, CEDAR, and supplementary sources (PRIDE), normalizes it into a unified schema, and annotates every observation with expert-curated sample metadata — biological source context, perturbation conditions, MHC class, species, cell line identity, and disease state. The goal is a single, auditable dataset that downstream tools (binding predictors, antigen prioritization pipelines, cleavage models) can consume without re-curating the same papers.

34 studies curated across 7 species, with per-sample perturbation conditions, MHC class I + II, and allele-level HLA typing. All MHC alleles validated with mhcgnomes.

From local IEDB + CEDAR data:

1.88M unique human class I peptides across 1,449 studies
790K unique human class II peptides across 805 studies
24 species with MHC ligand MS data
790 unique MHC allele strings, 789/790 valid in mhcgnomes

Install

pip install hitlist

Quick start

# Register your IEDB/CEDAR downloads
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv

# Build the search index (one-time, ~90s per file, cached as parquet)
hitlist data index

# Export curated sample metadata
hitlist export samples --class I -o class_i_samples.csv
hitlist export counts --source merged -o peptide_counts.csv
hitlist export summary

# Generate a data quality report
hitlist report --class I --output report.txt

What hitlist curates

Every IEDB/CEDAR mass spec observation is classified by:

Biological source context (mutually exclusive):

Category	Flag	Rule
Cancer	`src_cancer`	Tumor tissue, cancer patient biofluids, or non-EBV cell lines
Adjacent to tumor	`src_adjacent_to_tumor`	Surgically resected "normal" tissue (per-PMID override)
Activated APC	`src_activated_apc`	Monocyte-derived DCs/macrophages with pharmacological activation
Healthy somatic	`src_healthy_tissue`	Direct ex vivo, healthy donor, non-reproductive, non-thymic
Healthy thymus	`src_healthy_thymus`	Direct ex vivo thymus (expected for CTAs, AIRE-mediated)
Healthy reproductive	`src_healthy_reproductive`	Direct ex vivo testis, ovary, etc. (expected for CTAs)
EBV-LCL	`src_ebv_lcl`	EBV-transformed B-cell lines
Cell line	`src_cell_line`	Any cultured cell line

Per-sample metadata (from YAML overrides):

Field	Examples
MHC class	I, II, I+II, non-classical
Species	Human, mouse, cattle, dog, chicken, rhesus, pig, fish
Perturbation	CRISPR KO (13 genes), decitabine, IFN-gamma, viral infection, HLA-DM editing, SILAC cross-presentation, TIL expansion
HLA alleles	Per-donor or per-cell-line 4-digit typing
Cell line identity	721.221, HAP1, THP-1, etc. with mono-allelic detection

Key design rules

All non-EBV cell lines are cancer-derived, even when IEDB labels them "No immunization"
Cancer-specific = found in cancer AND NOT found in healthy somatic tissue
Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify a peptide from being cancer-specific
Perturbation conditions are tracked per sample — gene KOs, drug treatments, cytokine stimulation, viral infection, and other modifications that alter the immunopeptidome
MHC class I and II are tracked separately — filter with --class I or --class II

PMID curation overrides

Expert per-study overrides in hitlist/data/pmid_overrides.yaml:

- pmid: 29557506
  label: "Neidert 2018 — Tübingen/Zurich biobank"
  override: healthy
  rules:
    - condition:
        Source Tissue: [Blood, Bone Marrow, Cerebellum]
      override: healthy
      reason: "Blood bank donors and autopsy CNS material"
    - condition:
        Source Tissue: [Colon, Kidney, Liver, Lung]
      override: adjacent
      reason: "Visceral Surgery — likely cancer resection margins"
  ms_samples:
    - type: "blood (buffy coat)"
      condition: "unperturbed"
      mhc_class: "I"
      classification: healthy

See docs/pmid-curation.md for the full list of 34 curated studies, perturbation categories, and export commands.

Proteome mapping

Map peptides to source proteins with flanking context:

from hitlist.proteome import ProteomeIndex

# Human + viral proteomes, 10aa flanking
idx = ProteomeIndex.from_ensembl_plus_fastas(
    release=112,
    fasta_paths=["hpv16.fasta", "ebv.fasta", "influenza_a.fasta"],
)
df = idx.map_peptides(["SLLMWITQC", "GILGFVFTL"], flank=10)

Data management and indexing

hitlist data available            # show all known datasets
hitlist data fetch hpv16          # auto-download viral proteome
hitlist data register iedb /path  # register manual download
hitlist data list                 # show datasets + index cache status
hitlist data index                # build/rebuild parquet index
hitlist data index --force        # force re-index
hitlist data info iedb            # detailed metadata

The index is cached as parquet in ~/.hitlist/index/ and reused when the source CSV hasn't changed. First index: ~90s for 7.7 GB. Subsequent reads: <1s.

Export commands

hitlist export samples                     # per-sample conditions table
hitlist export samples --class I           # MHC class I only
hitlist export counts --source merged      # real peptide counts from IEDB+CEDAR
hitlist export counts --source all         # IEDB vs CEDAR side-by-side
hitlist export summary                     # species x class summary
hitlist export alleles                     # validate YAML alleles with mhcgnomes
hitlist export data-alleles                # validate all IEDB/CEDAR alleles

Python API

from hitlist.scanner import scan
from hitlist.curation import classify_ms_row, is_cancer_specific
from hitlist.aggregate import aggregate_per_peptide
from hitlist.indexer import get_index
from hitlist.export import generate_ms_samples_table, count_peptides_by_study

# Scan for specific peptides with source classification
hits = scan(peptides={"SLYNTVATL"}, iedb_path="mhc_ligand_full.csv", mhc_class="I")

# Per-peptide summary
summary = aggregate_per_peptide(hits)

# Cached index for fast counts (parquet)
study_df, allele_df = get_index("merged")

# Curated sample metadata from YAML
samples = generate_ms_samples_table(mhc_class="I")

Development

./develop.sh    # install in dev mode
./format.sh     # ruff format
./lint.sh       # ruff check + format check
./test.sh       # pytest with coverage
./deploy.sh     # lint + test + build + upload to PyPI

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.10.0

Apr 18, 2026

1.9.0

Apr 18, 2026

1.8.9

Apr 17, 2026

1.8.4

Apr 16, 2026

1.8.3

Apr 16, 2026

1.8.2

Apr 16, 2026

1.8.1

Apr 16, 2026

1.8.0

Apr 15, 2026

1.7.5

Apr 15, 2026

1.7.3

Apr 15, 2026

1.7.2

Apr 15, 2026

1.7.1

Apr 15, 2026

1.7.0

Apr 14, 2026

1.6.0

Apr 13, 2026

1.5.1

Apr 13, 2026

1.5.0

Apr 13, 2026

1.4.5

Apr 13, 2026

1.4.1

Apr 13, 2026

1.4.0

Apr 12, 2026

1.3.0

Apr 12, 2026

1.2.0

Apr 12, 2026

1.1.0

Apr 11, 2026

1.0.3

Apr 11, 2026

1.0.2

Apr 11, 2026

1.0.1

Apr 11, 2026

1.0.0

Apr 11, 2026

0.9.8

Apr 9, 2026

0.9.7

Apr 9, 2026

This version

0.9.6

Apr 9, 2026

0.9.5

Apr 9, 2026

0.9.4

Apr 8, 2026

0.9.3

Apr 8, 2026

0.9.2

Apr 8, 2026

0.9.1

Apr 8, 2026

0.9.0

Apr 7, 2026

0.8.0

Apr 6, 2026

0.7.0

Apr 5, 2026

0.6.0

Apr 5, 2026

0.5.1

Apr 5, 2026

0.5.0

Apr 1, 2026

0.4.0

Mar 30, 2026

0.3.0

Mar 30, 2026

0.2.0

Mar 30, 2026

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hitlist-0.9.6.tar.gz (84.4 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hitlist-0.9.6-py3-none-any.whl (83.6 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file hitlist-0.9.6.tar.gz.

File metadata

Download URL: hitlist-0.9.6.tar.gz
Upload date: Apr 9, 2026
Size: 84.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.9.6.tar.gz
Algorithm	Hash digest
SHA256	`77388d440e18631c32730e97246a66ab83dc06a7379f3ccf7e45ac0761b8523d`
MD5	`cf305c0789bc57ae47f9e8c7d7b31fff`
BLAKE2b-256	`8b7c6b094544973c86f624dcc2b054635020ed29ad935a28985134b2432d826f`

See more details on using hashes here.

File details

Details for the file hitlist-0.9.6-py3-none-any.whl.

File metadata

Download URL: hitlist-0.9.6-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 83.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for hitlist-0.9.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96bf8092f5c5cb338099afc6af6f4d12577385d9430a6b1e08efb259b31607c5`
MD5	`a92b2bf60dd54179b570d9c0f664c918`
BLAKE2b-256	`e3ffe645243c759f586bca26d141b216a73bd1afd2e7d8eb129c424cce656f79`

See more details on using hashes here.

hitlist 0.9.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hitlist

Install

Quick start

What hitlist curates

Key design rules

PMID curation overrides

Proteome mapping

Data management and indexing

Export commands

Python API

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes