pepmatch

Search tool for peptides and epitopes within a proteome, while considering potential residue substitutions.

Project description

PEPMatch Logo

Author: Daniel Marrama

PEPMatch is a high-performance Python tool designed to find short peptide sequences within a reference proteome or other large protein sets. It is optimized for speed and flexibility, supporting exact matches, searches with a defined number of residue substitutions (mismatches), and a "best match" mode to find the most likely hit.

As a competition to improve tool performance, we created a benchmarking framework with instructions here.

Key Features

Versatile Searching: Find exact matches, matches with a specified tolerance for mismatches, or the single best match for each query peptide.
Discontinuous Epitope Support: Search for non-contiguous residues in the format "R377, Q408, Q432, ...".
High Performance: Utilizes an efficient k-mer indexing strategy for rapid searching. The backend is powered by a C-based Hamming distance calculation for optimized mismatch detection.
Optimized Preprocessing: Employs a two-step process. Proteomes are preprocessed once into a format optimized for the search type (SQLite for exact matching, Pickle for mismatching), making subsequent searches extremely fast.
Parallel Processing: Built-in support for multicore processing to handle large query sets efficiently.
Flexible I/O: Accepts queries from FASTA files or Python lists and can output results to multiple formats, including CSV, TSV, XLSX, JSON, or directly as a Polars DataFrame.

Requirements

Python 3.7+
Polars
Biopython

Installation

pip install pepmatch

Core Engine

PEPMatch operates using a two-step workflow:

Preprocessing: First, the target proteome is processed into an indexed format. This step only needs to be performed once per proteome and k-mer size. PEPMatch uses SQLite databases for the speed of indexed lookups in exact matching and serialized Python objects (pickle) for the flexibility needed in mismatch searching.
Matching: The user's query peptides are then searched against the preprocessed proteome.

This design ensures that the time-intensive task of parsing and indexing the proteome is separated from the search itself, allowing for rapid and repeated querying.

Command-Line Usage

The tool provides two CLI commands: pepmatch-preprocess and pepmatch-match.

1. Preprocessing

The pepmatch-preprocess command builds the necessary database from your proteome FASTA file.

For exact matching (0 mismatches), use the sql format.
For mismatch matching, use the pickle format.

# Preprocess for an exact match search using 5-mers
pepmatch-preprocess -p human.fasta -k 5 -f sql

# Preprocess for a mismatch search using 3-mers
pepmatch-preprocess -p human.fasta -k 3 -f pickle

Flags

-p, --proteome (Required): Path to the proteome FASTA file.
-k, --kmer_size (Required): The k-mer size to use for indexing.
-f, --preprocess_format (Required): The format for the preprocessed database (sql or pickle).
-n: A custom name for the proteome.
-P: Path to the directory to save preprocessed files.
-g: Path to a gene priority proteome file (UniProt specific 1-1 protein per gene file to prioritize matches later)

2. Matching

The pepmatch-match command runs the search against a preprocessed proteome.

# Find exact matches (-m 0) using the preprocessed 5-mer database
pepmatch-match -q peptides.fasta -p human.fasta -m 0 -k 5

# Find matches with up to 3 mismatches (-m 3) using the 3-mer database
pepmatch-match -q neoepitopes.fasta -p human.fasta -m 3 -k 3

Flags

-q, --query (Required): Path to the query peptide FASTA file.
-p, --proteome_file (Required): Path to the original proteome FASTA file.
-m: Maximum number of mismatches allowed (e.g., 0 for exact).
-k: The k-mer size to use (must match the preprocessed file).
-P: Path to the directory containing preprocessed files.
-b: Enable "best match" mode.
-f: Output format (csv, tsv, xlsx, json). Defaults to csv.
-o: Name of the output file (do not include the file extension, i.e. .csv)
-v: Disable sequence versioning (e.g. for protein ID P05067.1, ".1" will be removed.)
-n: Number of parallel processing jobs (CPU cores) to use.

Python API Usage

For more control and integration into other workflows, PEPMatch provides a simple Python API.

1. Exact Matching

from pepmatch import Preprocessor, Matcher

# Preprocess the proteome into a SQLite DB for exact matching
Preprocessor('proteomes/human.fasta').sql_proteome(k=5)

# Initialize the Matcher for an exact search (0 mismatches)
matcher = Matcher(
  query='queries/mhc-ligands-test.fasta',
  proteome_file='proteomes/human.fasta',
  max_mismatches=0,
  k=5
)

# Run the search and get results
results_df = matcher.match()

2. Mismatching

from pepmatch import Preprocessor, Matcher

# Preprocess the proteome into pickle files for mismatching
Preprocessor('proteomes/human.fasta').pickle_proteome(k=3)

# Initialize the Matcher to allow up to 3 mismatches
matcher = Matcher(
  query='queries/neoepitopes-test.fasta',
  proteome_file='proteomes/human.fasta',
  max_mismatches=3,
  k=3
)

results_df = matcher.match()

3. Best Match

The best_match mode automatically finds the optimal match for each peptide, trying different k-mer sizes and mismatch thresholds. No manual preprocessing is required.

from pepmatch import Matcher

matcher = Matcher(
  query='queries/milk-peptides-test.fasta',
  proteome_file='proteomes/human.fasta',
  best_match=True
)

results_df = matcher.match()

4. Parallel Processing

Use the ParallelMatcher class to run searches on multiple CPU cores. The n_jobs parameter specifies the number of cores to use.

from pepmatch import Preprocessor, ParallelMatcher

# Preprocessing is the same
Preprocessor('proteomes/betacoronaviruses.fasta').pickle_proteome(k=3)

# Use ParallelMatcher to search with 4 jobs
parallel_matcher = ParallelMatcher(
  query='queries/coronavirus-test.fasta',
  proteome_file='proteomes/betacoronaviruses.fasta',
  max_mismatches=3,
  k=3,
  n_jobs=4
)

results_df = parallel_matcher.match()

5. Discontinuous Epitope Searching

PEPMatch can search for epitopes defined by non-contiguous residues and their positions. Simply provide a query list where each item is a string in the format "A1, B10, C15".

from pepmatch import Matcher

# A list of discontinuous epitopes to find
discontinuous_query = [
  "R377, Q408, Q432, H433, F436",
  "S2760, V2763, E2773, D2805, T2819"
]

matcher = Matcher(
  query=discontinuous_query,
  proteome_file='proteomes/sars-cov-2.fasta',
  max_mismatches=1  # Allow 1 mismatch among the specified residues
)

results_df = matcher.match()

Output Formats

You can specify the output format using the output_format parameter in the Matcher or ParallelMatcher.

dataframe (default for API): Returns a Polars DataFrame.
csv (default for CLI): Saves results to a CSV file.
tsv: Saves results to a TSV file.
xlsx: Saves results to an Excel file.
json: Saves results to a JSON file.

To receive a DataFrame from the API, you can either omit the output_format parameter or set it explicitly:

# The match() method will return a Polars DataFrame
df = Matcher(
  'queries/neoepitopes-test.fasta',
  'proteomes/human.fasta',
  max_mismatches=3,
  k=3,
  output_format='dataframe' # Explicitly request a DataFrame
).match()

print(df.head())

Citation

If you use PEPMatch in your research, please cite the following paper:

Marrama D, Chronister WD, Westernberg L, et al. PEPMatch: a tool to identify short peptide sequence matches in large sets of proteins. BMC Bioinformatics. 2023;24(1):485. Published 2023 Dec 18. doi:10.1186/s12859-023-05606-4

Project details

Release history Release notifications | RSS feed

This version

1.5.0

Jan 16, 2026

1.4.2

Jan 12, 2026

1.4.1

Oct 17, 2025

1.4.0

Sep 17, 2025

1.3.1

Aug 25, 2025

1.3.0

Jun 28, 2025

1.2.0

Jun 26, 2025

1.1.2

Jun 25, 2025

1.1.1

Feb 5, 2025

1.1.0

Feb 5, 2025

1.0.5

Jun 10, 2024

1.0.4

Jun 10, 2024

1.0.3

Feb 25, 2024

1.0.2

Feb 23, 2024

1.0.1

Feb 6, 2024

1.0.0

Jan 25, 2024

0.9.6

Oct 12, 2023

0.9.5

Sep 13, 2023

0.9.4

Aug 4, 2023

0.9.3

Jul 3, 2023

0.9.2

Jun 23, 2023

0.9.1

May 23, 2023

0.9.0

Mar 30, 2023

0.8.4

Mar 15, 2023

0.8.3

Mar 10, 2023

0.8.2

Mar 6, 2023

0.8.1

Mar 3, 2023

0.8

Feb 6, 2023

0.7.17

Jun 3, 2022

0.7.16

May 6, 2022

0.7.15

May 3, 2022

0.7.14

May 2, 2022

0.7.13

Apr 26, 2022

0.7.12

Apr 15, 2022

0.7.10

Mar 7, 2022

0.7.9

Feb 8, 2022

0.7.8

Dec 14, 2021

0.7.7

Oct 22, 2021

0.7.6

Oct 13, 2021

0.7.5

Oct 13, 2021

0.7.4

Oct 7, 2021

0.7.3

Oct 6, 2021

0.7.2

Jul 16, 2021

0.7

Jul 1, 2021

0.6.3

Jun 8, 2021

0.6.2

Jun 8, 2021

0.6.1

Jun 5, 2021

0.6.0

Jun 5, 2021

0.5.3

Mar 30, 2021

0.5.2

Mar 5, 2021

0.5.1

Mar 3, 2021

0.5

Mar 3, 2021

0.4.2

Mar 2, 2021

0.4.1

Feb 23, 2021

0.4

Feb 18, 2021

0.3.3

Feb 16, 2021

0.3.2

Feb 16, 2021

0.3.1

Jan 26, 2021

0.3

Jan 26, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepmatch-1.5.0.tar.gz (31.2 kB view details)

Uploaded Jan 16, 2026 Source

File details

Details for the file pepmatch-1.5.0.tar.gz.

File metadata

Download URL: pepmatch-1.5.0.tar.gz
Upload date: Jan 16, 2026
Size: 31.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pepmatch-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`a2ae27eea9c07ed32ef3e73d153ae21999357433ba8c1e8ae841c11d00e2f259`
MD5	`58d8d5ad18a894b7214c28debbe759fa`
BLAKE2b-256	`38e216de0c217e563816fe3904f2503ad62f8e21f07253ad2af4c147dfa5727f`

See more details on using hashes here.

pepmatch 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Key Features

Requirements

Installation

Core Engine

Command-Line Usage

1. Preprocessing

Flags

2. Matching

Flags

Python API Usage

1. Exact Matching

2. Mismatching

3. Best Match

4. Parallel Processing

5. Discontinuous Epitope Searching

Output Formats

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes