Skip to main content

Search tool for peptides and epitopes within a proteome, while considering potential residue substitutions.

Project description

PEPMatch Logo


Unit Tests

Author: Daniel Marrama

PEPMatch is a high-performance Python tool designed to find short peptide sequences within a reference proteome or other large protein sets. It is optimized for speed and flexibility, supporting exact matches, searches with a defined number of residue substitutions (mismatches), and a "best match" mode to find the most likely hit.

As a competition to improve tool performance, we created a benchmarking framework with instructions here.

Key Features

  • Versatile Searching: Find exact matches, matches with a specified tolerance for mismatches, or the single best match for each query peptide.
  • Discontinuous Epitope Support: Search for non-contiguous residues in the format "R377, Q408, Q432, ...".
  • High Performance: Utilizes an efficient k-mer indexing strategy for rapid searching. The backend is powered by a C-based Hamming distance calculation for optimized mismatch detection.
  • Optimized Preprocessing: Employs a two-step process. Proteomes are preprocessed once into a format optimized for the search type (SQLite for exact matching, Pickle for mismatching), making subsequent searches extremely fast.
  • Parallel Processing: Built-in support for multicore processing to handle large query sets efficiently.
  • Flexible I/O: Accepts queries from FASTA files or Python lists and can output results to multiple formats, including CSV, TSV, XLSX, JSON, or directly as a Polars DataFrame.

Requirements

Installation

pip install pepmatch

Core Engine

PEPMatch operates using a two-step workflow:

  1. Preprocessing: First, the target proteome is processed into an indexed format. This step only needs to be performed once per proteome and k-mer size. PEPMatch uses SQLite databases for the speed of indexed lookups in exact matching and serialized Python objects (pickle) for the flexibility needed in mismatch searching.
  2. Matching: The user's query peptides are then searched against the preprocessed proteome.

This design ensures that the time-intensive task of parsing and indexing the proteome is separated from the search itself, allowing for rapid and repeated querying.

Command-Line Usage

The tool provides two CLI commands: pepmatch-preprocess and pepmatch-match.

1. Preprocessing

The pepmatch-preprocess command builds the necessary database from your proteome FASTA file.

  • For exact matching (0 mismatches), use the sql format.
  • For mismatch matching, use the pickle format.
# Preprocess for an exact match search using 5-mers
pepmatch-preprocess -p human.fasta -k 5 -f sql

# Preprocess for a mismatch search using 3-mers
pepmatch-preprocess -p human.fasta -k 3 -f pickle
Flags
  • -p, --proteome (Required): Path to the proteome FASTA file.
  • -k, --kmer_size (Required): The k-mer size to use for indexing.
  • -f, --preprocess_format (Required): The format for the preprocessed database (sql or pickle).
  • -n: A custom name for the proteome.
  • -P: Path to the directory to save preprocessed files.
  • -g: Path to a gene priority proteome file (UniProt specific 1-1 protein per gene file to prioritize matches later)

2. Matching

The pepmatch-match command runs the search against a preprocessed proteome.

# Find exact matches (-m 0) using the preprocessed 5-mer database
pepmatch-match -q peptides.fasta -p human.fasta -m 0 -k 5

# Find matches with up to 3 mismatches (-m 3) using the 3-mer database
pepmatch-match -q neoepitopes.fasta -p human.fasta -m 3 -k 3
Flags
  • -q, --query (Required): Path to the query peptide FASTA file.
  • -p, --proteome_file (Required): Path to the original proteome FASTA file.
  • -m: Maximum number of mismatches allowed (e.g., 0 for exact).
  • -k: The k-mer size to use (must match the preprocessed file).
  • -P: Path to the directory containing preprocessed files.
  • -b: Enable "best match" mode.
  • -f: Output format (csv, tsv, xlsx, json). Defaults to csv.
  • -o: Name of the output file (do not include the file extension, i.e. .csv)
  • -v: Disable sequence versioning (e.g. for protein ID P05067.1, ".1" will be removed.)
  • -n: Number of parallel processing jobs (CPU cores) to use.

Python API Usage

For more control and integration into other workflows, PEPMatch provides a simple Python API.

1. Exact Matching

from pepmatch import Preprocessor, Matcher

# Preprocess the proteome into a SQLite DB for exact matching
Preprocessor('proteomes/human.fasta').sql_proteome(k=5)

# Initialize the Matcher for an exact search (0 mismatches)
matcher = Matcher(
  query='queries/mhc-ligands-test.fasta',
  proteome_file='proteomes/human.fasta',
  max_mismatches=0,
  k=5
)

# Run the search and get results
results_df = matcher.match()

2. Mismatching

from pepmatch import Preprocessor, Matcher

# Preprocess the proteome into pickle files for mismatching
Preprocessor('proteomes/human.fasta').pickle_proteome(k=3)

# Initialize the Matcher to allow up to 3 mismatches
matcher = Matcher(
  query='queries/neoepitopes-test.fasta',
  proteome_file='proteomes/human.fasta',
  max_mismatches=3,
  k=3
)

results_df = matcher.match()

3. Best Match

The best_match mode automatically finds the optimal match for each peptide, trying different k-mer sizes and mismatch thresholds. No manual preprocessing is required.

from pepmatch import Matcher

matcher = Matcher(
  query='queries/milk-peptides-test.fasta',
  proteome_file='proteomes/human.fasta',
  best_match=True
)

results_df = matcher.match()

4. Parallel Processing

Use the ParallelMatcher class to run searches on multiple CPU cores. The n_jobs parameter specifies the number of cores to use.

from pepmatch import Preprocessor, ParallelMatcher

# Preprocessing is the same
Preprocessor('proteomes/betacoronaviruses.fasta').pickle_proteome(k=3)

# Use ParallelMatcher to search with 4 jobs
parallel_matcher = ParallelMatcher(
  query='queries/coronavirus-test.fasta',
  proteome_file='proteomes/betacoronaviruses.fasta',
  max_mismatches=3,
  k=3,
  n_jobs=4
)

results_df = parallel_matcher.match()

5. Discontinuous Epitope Searching

PEPMatch can search for epitopes defined by non-contiguous residues and their positions. Simply provide a query list where each item is a string in the format "A1, B10, C15".

from pepmatch import Matcher

# A list of discontinuous epitopes to find
discontinuous_query = [
  "R377, Q408, Q432, H433, F436",
  "S2760, V2763, E2773, D2805, T2819"
]

matcher = Matcher(
  query=discontinuous_query,
  proteome_file='proteomes/sars-cov-2.fasta',
  max_mismatches=1  # Allow 1 mismatch among the specified residues
)

results_df = matcher.match()

Output Formats

You can specify the output format using the output_format parameter in the Matcher or ParallelMatcher.

  • dataframe (default for API): Returns a Polars DataFrame.
  • csv (default for CLI): Saves results to a CSV file.
  • tsv: Saves results to a TSV file.
  • xlsx: Saves results to an Excel file.
  • json: Saves results to a JSON file.

To receive a DataFrame from the API, you can either omit the output_format parameter or set it explicitly:

# The match() method will return a Polars DataFrame
df = Matcher(
  'queries/neoepitopes-test.fasta',
  'proteomes/human.fasta',
  max_mismatches=3,
  k=3,
  output_format='dataframe' # Explicitly request a DataFrame
).match()

print(df.head())

Citation

If you use PEPMatch in your research, please cite the following paper:

Marrama D, Chronister WD, Westernberg L, et al. PEPMatch: a tool to identify short peptide sequence matches in large sets of proteins. BMC Bioinformatics. 2023;24(1):485. Published 2023 Dec 18. doi:10.1186/s12859-023-05606-4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepmatch-1.5.0.tar.gz (31.2 kB view details)

Uploaded Source

File details

Details for the file pepmatch-1.5.0.tar.gz.

File metadata

  • Download URL: pepmatch-1.5.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pepmatch-1.5.0.tar.gz
Algorithm Hash digest
SHA256 a2ae27eea9c07ed32ef3e73d153ae21999357433ba8c1e8ae841c11d00e2f259
MD5 58d8d5ad18a894b7214c28debbe759fa
BLAKE2b-256 38e216de0c217e563816fe3904f2503ad62f8e21f07253ad2af4c147dfa5727f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page