Skip to main content

DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.

Project description

DarkProfiler

DarkProfiler: Alignment and Classification of Peptides from Reference‑Independent De Novo Peptide Sequencing Experiments

PyPI version

DarkProfiler

DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo peptide sequencing) and classifies them into distinct categories using reference genomes and optional sample‑specific SNVs:

  • Canonical proteome
  • Alternative splicing
  • Neoantigens (SNV‑derived mutanome)
  • Alternative reading frame peptides
  • Unknown / unaligned

DarkProfiler is intended to be the post‑processing / annotation step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.

Supported reference assemblies:

  • Human: hg19 (GENCODE release 19), hg38 (GENCODE release 37)
  • Mouse: mm10 (GENCODE release M19), mm39 (GENCODE release M37)

The same logic is available both as a command‑line tool and as a Python API.


Table of contents

  1. Installation
  2. Reference genome data
  3. Input data
  4. Command‑line usage
  5. Python API
  6. Classification pipeline details
  7. Outputs
  8. Database reuse and performance tips
  9. Troubleshooting
  10. License
  11. Citation

Installation

Requirements

  • Python: 3.7+ (tested on modern CPython versions)
  • Operating systems: Linux, macOS, and other UNIX‑like systems should work. Windows with WSL is recommended.
  • Python dependencies (installed automatically via pip/conda):
    • Biopython (FASTA parsing and sequence utilities)
    • matplotlib (for pieChart.pdf)
    • Standard library modules only otherwise

You also need sufficient disk space to store:

  • A reference genome bundle per assembly (hundreds of MB)
  • The database directory (translated proteomes + fast indices) per output folder
  • The final classification FASTA files and plots

Install with pip (PyPI)

pip install darkprofiler

This installs:

  • The Python package darkprofiler
  • The command‑line entry point darkprofiler

You should then be able to run:

darkprofiler --help

Install with conda (bioconda)

conda install bioconda::darkprofiler

This will install DarkProfiler together with all dependencies into the active conda environment.


Reference genome data

Supported references

DarkProfiler currently supports human and mouse reference assemblies that are aligned to GENCODE releases:

hg19 (GENCODE release 19)
hg38 (GENCODE release 37)
mm10 (GENCODE release M19)
mm39 (GENCODE release M37)

The reference is always specified by one of the lower‑case strings:

  • hg19
  • hg38
  • mm10
  • mm39

Internally the reference is normalized to lower case, so HG38 and hg38 are treated the same in the Python API, but the CLI restricts choices to the canonical lower‑case names.

What gets downloaded

Reference data are distributed as versioned ZIP bundles hosted online. You do not need to download or unpack them manually. Use:

darkprofiler download hg38

This will:

  1. Check that the requested reference is supported.

  2. Download a file named like darkprofiler_hg38.zip to the installed package directory under darkprofiler/genome/.

  3. Extract the contents to:

    <python-site-packages>/darkprofiler/genome/hg38/
    
  4. Print progress messages such as:

    [darkprofiler] Downloading ...
    [darkprofiler] Extracting to ...
    [darkprofiler] Finished. Reference 'hg38' is now available.
    

The extracted directory contains at least the following files (names may include version tags):

  • transcriptome.<reference>.fa – all reference transcripts (FASTA)
  • transcriptome.<reference>.cds.bed – CDS segments per transcript
  • knownCanonical.<reference>.list – list of canonical transcript IDs
  • gencode.<reference>.gff – GENCODE annotation (GFF/GTF‑like)
  • exome.<reference>.bed – exome intervals used to filter SNVs

These files are used internally by the pipeline; you normally don’t need to interact with them directly.

Note: If the download step has not been run for a given reference, darkprofiler run will fail with an error such as “Could not find file ... in genome root”.


Input data

Peptide FASTA

The primary input is a FASTA file containing peptide sequences to classify:

>peptide_1
LLLLGIGGTFK
>peptide_2
EAVAEQAALR
...

Requirements and recommendations:

  • Each record is interpreted as a peptide (amino‑acid sequence).
  • FASTA IDs are kept as‑is and propagated to the output files.
  • Sequences are upper‑cased internally; non‑standard characters are not specially treated.
  • Empty sequences are silently ignored.
  • There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.

A peptide sequence is assigned to at most one output category within a given hamming distance, corresponding to the first category that matches in the pipeline (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).

VCF with SNVs (optional)

To classify neoantigens (peptides derived from sample‑specific single nucleotide variants), you can provide a VCF file via --vcf-path / vcf_path:

  • Accepts plain or gzipped VCF: *.vcf or *.vcf.gz.
  • Only SNVs (single‑base reference and single‑base alternate) are used.
  • Multi‑allelic entries are expanded and processed per ALT allele.
  • Non‑SNV variants (indels, MNVs, etc.) are ignored.
  • Coordinates are matched to the reference via chromosome names that are normalized to strip the chr prefix (chr11).

DarkProfiler additionally filters SNVs to the coding exome using the exome.<reference>.bed file if present:

  • Only SNVs whose positions overlap the exome intervals are retained.
  • If no exome BED is available, all SNVs are accepted.

If vcf_path is omitted or points to a non‑existing file:

  • The SNV list is empty.
  • The mutanome and neoantigen steps still run, but represent the unmodified reference sequence.

Precomputed database directory (optional)

By default, each darkprofiler run invocation builds a database in:

<output_dir>/database/

The database contains translated and derived proteomes as FASTA files:

  • canonicalProteome.fa
  • alternativeSplicing.fa
  • mutanome.fa
  • mutatedCanonicalTranscriptome.fa
  • mutatedAlternativeTranslatome.fa

DarkProfiler also creates persistent fast indices under the same database directory to accelerate peptide search with Hamming distance: for example:

  • canonicalProteome.idx/
  • alternativeSplicing.idx/
  • mutanome.idx/
  • mutatedAlternativeORFeome.idx/

If you run DarkProfiler repeatedly with the same reference and SNV set, you can re‑use a prebuilt database to avoid recomputation by passing --database-path / database_path:

darkprofiler run hg38 peptides.fa out --database-path prebuilt_db/

The directory is accepted only if all required files are present. Otherwise:

  • DarkProfiler prints a warning that the directory is missing files or is invalid.
  • The directory is ignored.
  • A new database is built from scratch under <output_dir>/database.

Command‑line usage

The installed CLI is called darkprofiler.

Run darkprofiler --help to see the top‑level usage:

usage: darkprofiler [-h] {download,run} ...

Two subcommands are available:

download subcommand

darkprofiler download hg38

run subcommand

darkprofiler run hg38 peptides.fa output_dir \
  --vcf-path sample.vcf.gz \
  --database-path /path/to/database \
  --num-threads 8 \
  --hamming 2

Optional arguments

  • --vcf-path FILE

    Optional path to a VCF or VCF.GZ file with SNVs.

  • --database-path DIR

    Optional path to an existing database directory containing the required FASTA files listed above.

  • --num-threads N (default: 1)

    Number of worker threads used during peptide search / verification.

  • -k, --hamming {0,1,2} (default: 0)

    Maximum Hamming distance allowed for peptide matching.
    0 performs exact matches only; 1 and 2 allow up to one or two amino‑acid substitutions.


Python API

from darkprofiler.run import classify_peptides

classify_peptides(
    reference="hg38",
    peptide_fasta="peptides.fa",
    output_dir="output",
    vcf_path=None,
    database_path=None,
    num_threads=4,
    hamming_distance=0,
)

Classification pipeline details

Overview of steps

  1. Filter VCF to exome
  2. Load transcriptome, CDS annotations, canonical transcript list
  3. Build canonical / non‑canonical transcript sets
  4. Build canonical proteome (CDS must start with ATG) and classify peptides
  5. Build alternative splicing proteome (CDS must start with ATG) and classify peptides
  6. Apply SNVs, build mutanome (CDS must start with ATG) and classify peptides
  7. Build alternative ORFs (3 frames) and classify peptides
  8. Write unaligned peptides and summary plots
  9. Finalize

Category definitions

  • CDS translation filter (ATG)
    For CDS‑based proteomes (canonical proteome, alternative splicing, mutanome), CDS translations are included only when the CDS begins with ATG. This reduces false positives from incomplete or mis‑annotated CDS records.

  • ORF region labels
    For alternative ORF hits, DarkProfiler labels the peptide start as:

    • uORF (upstream of CDS start)
    • intORF (out-of-frame peptdies from inside annotated CDS span)
    • dORF (downstream of CDS end)
    • lncRNA (no CDS annotation)

Outputs

All outputs live in the specified output_dir.

FASTA category files

Each category is represented by a separate FASTA file in output_dir:

  • canonicalProteome.fa
  • alternativeSplicing.fa
  • neoantigen.fa
  • alternativeReadingFrame.fa
  • unknown.fa

For classification FASTAs (all except unknown.fa), each record uses:

> referencePeptide | TranscriptID | nucleotide coordinate on transcript | uORF/intORF/dORF/lncRNA/CDS
queryPeptide
  • referencePeptide: matched reference peptide sequence (substring from the reference proteome/ORF; same length as the query)
  • TranscriptID: transcript identifier (for alternative ORFs, this is the underlying transcript)
  • nucleotide coordinate on transcript: 1‑based transcript coordinate of the peptide start codon (frame‑aware for alternative ORFs)
  • uORF/intORF/dORF/lncRNA/CDS:
    • CDS for canonical proteome / alternative splicing / neoantigen hits
    • uORF, intORF, dORF, lncRNA for alternative ORF hits

Example:

> GILGFVFTL | ENST00000335137.4 | 1234 | CDS
GILGFVFTL

unknown.fa uses the original peptide IDs and sequences without additional fields.

pieChart.tsv

A tab‑separated summary file with one line per category:

Category    Count
canonical   123
alternativeSplicing 45
neoantigen  7
alternativeReadingFrame 32
unknown     83

pieChart.pdf

A pie chart illustrating the fraction of peptides in each category is saved as pieChart.pdf.


Database reuse and performance tips

  • Reuse databases
    Use --database-path to reuse a database directory containing the required FASTA files.

  • Persistent fast indices
    DarkProfiler builds on‑disk indices (*.idx/) for fast peptide lookup with Hamming distance ≤ 2 using a pigeonhole (seed‑and‑verify) strategy. When an index directory exists, it is reused automatically.

  • Multi‑threading
    Increase --num-threads to speed up peptide search / verification on multi‑core machines.


Troubleshooting

Unsupported reference

  • The reference must be one of hg19, hg38, mm10, mm39.

Missing genome files

  • Run darkprofiler download <reference> in the same environment.

Large runtime

  • Increase --num-threads.
  • Use -k/--hamming 0 for exact matching only when appropriate.
  • Reuse databases and indices between runs.

License

DarkProfiler is released under the MIT License.


Citation

If you use DarkProfiler in a scientific publication, please cite it as:

(Updated citation information will be provided once an associated preprint or manuscript is available.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darkprofiler-0.2.6.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

darkprofiler-0.2.6-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file darkprofiler-0.2.6.tar.gz.

File metadata

  • Download URL: darkprofiler-0.2.6.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.2.6.tar.gz
Algorithm Hash digest
SHA256 d8d0e387307f03ac360eaed9aa6a370c6b451c290dde93636d5f1a6a000ab400
MD5 f868aa56a2e525359ab42e560c050eda
BLAKE2b-256 71b238ff8aa687ed6edbb9a6d4792fa139845e3a2c074667c9dc175e36b85ec2

See more details on using hashes here.

File details

Details for the file darkprofiler-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: darkprofiler-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9d32d3ab982ae4bbf2c7ca779277f631ce1c1769f2c261b6413387f4eaf42b0c
MD5 e594d0073120a04d3a94d2fe3a6e27ba
BLAKE2b-256 e1956d7ca142708712152bb29f0c5c7f1cb5efc2d57d21c0afcf562366ce69e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page