DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.

These details have not been verified by PyPI

Project description

DarkProfiler

DarkProfiler: Alignment and Classification of Peptides from Reference‑Independent De Novo Peptide Sequencing Experiments

DarkProfiler

DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo peptide sequencing) and classifies them into distinct categories using reference genomes and optional sample‑specific SNVs:

Canonical proteome
Alternative splicing
Neoantigens (SNV‑derived mutanome)
Alternative reading frame peptides
Unknown / unaligned

DarkProfiler also provides a separate spliced peptide mode for classifying peptides as contiguous reference matches, proteasome‑catalyzed spliced peptides, or unknown peptides against a canonical proteome or user‑provided protein FASTA.

DarkProfiler is intended to be the post‑processing / annotation step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.

Supported reference assemblies:

Human: hg19 (GENCODE release 19), hg38 (GENCODE release 37)
Mouse: mm10 (GENCODE release M19), mm39 (GENCODE release M37)

The same logic is available both as a command‑line tool and as a Python API.

Installation
Reference genome data
- Supported references
- What gets downloaded
Input data
Command‑line usage
Python API
- Function reference
- Python examples
Classification pipeline details
- Overview of steps
- Category definitions
Outputs
Database reuse and performance tips
Troubleshooting
License
Citation

Installation

Requirements

Python: 3.7+ (tested on modern CPython versions)
Operating systems: Linux, macOS, and other UNIX‑like systems should work. Windows with WSL is recommended.
Python dependencies (installed automatically via pip/conda):
- Biopython (FASTA parsing and sequence utilities)
- matplotlib (for pieChart.pdf)
- Standard library modules only otherwise

You also need sufficient disk space to store:

A reference genome bundle per assembly (hundreds of MB)
The database directory (translated proteomes + fast indices) per output folder
The final classification FASTA files and plots

Install with pip (PyPI)

pip install darkprofiler

This installs:

The Python package darkprofiler
The command‑line entry point darkprofiler

You should then be able to run:

darkprofiler --help

Install with conda (bioconda)

conda install bioconda::darkprofiler

This will install DarkProfiler together with all dependencies into the active conda environment.

Reference genome data

Supported references

DarkProfiler currently supports human and mouse reference assemblies that are aligned to GENCODE releases:

hg19 (GENCODE release 19)
hg38 (GENCODE release 37)
mm10 (GENCODE release M19)
mm39 (GENCODE release M37)

The reference is always specified by one of the lower‑case strings:

hg19
hg38
mm10
mm39

Internally the reference is normalized to lower case, so HG38 and hg38 are treated the same in the Python API, but the CLI restricts choices to the canonical lower‑case names.

What gets downloaded

Reference data are distributed as versioned ZIP bundles hosted online. You do not need to download or unpack them manually. Use:

darkprofiler download hg38

This will:

Check that the requested reference is supported.
Download a file named like darkprofiler_hg38.zip to the installed package directory under darkprofiler/genome/.

Extract the contents to:

<python-site-packages>/darkprofiler/genome/hg38/

Print progress messages such as:

[darkprofiler] Downloading ...
[darkprofiler] Extracting to ...
[darkprofiler] Finished. Reference 'hg38' is now available.

The extracted directory contains at least the following files (names may include version tags):

transcriptome.<reference>.fa – all reference transcripts (FASTA)
transcriptome.<reference>.cds.bed – CDS segments per transcript
knownCanonical.<reference>.list – list of canonical transcript IDs
gencode.<reference>.gff – GENCODE annotation (GFF/GTF‑like)
exome.<reference>.bed – exome intervals used to filter SNVs

These files are used internally by the pipeline; you normally don’t need to interact with them directly.

Note: If the download step has not been run for a given reference, darkprofiler run will fail with an error such as “Could not find file ... in genome root”.

Input data

Peptide FASTA

The primary input is a FASTA file containing peptide sequences to classify:

>peptide_1
LLLLGIGGTFK
>peptide_2
EAVAEQAALR
...

Requirements and recommendations:

Each record is interpreted as a peptide (amino‑acid sequence).
FASTA IDs are kept as‑is and propagated to the output files.
Sequences are upper‑cased internally; non‑standard characters are not specially treated.
Empty sequences are silently ignored.
There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.

A peptide sequence is assigned to at most one output category within a given hamming distance, corresponding to the first category that matches in the pipeline (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).

For darkprofiler splice, peptides outside 7-12 amino acids are written to unknown.fa. Peptides of length 7-12 amino acids are searched with two exact pigeonhole seed blocks and no Hamming mismatches. With two blocks over this length range, every pigeonhole seed block is at least 3 amino acids long.

Reference FASTA for spliced peptide mode

The splice subcommand accepts its reference as either:

One of the downloaded reference names: hg19, hg38, mm10, or mm39
A path to an existing protein FASTA file

When a reference name is provided, DarkProfiler builds canonicalProteome.fa using the same canonical transcript and CDS translation strategy used by darkprofiler run: canonical transcript IDs are read from knownCanonical.<reference>.list, CDS intervals are read from transcriptome.<reference>.cds.bed, and translated CDS records are retained only when the CDS starts with ATG.

When a FASTA path is provided, that FASTA is copied into the splice database as canonicalProteome.fa and used directly as the reference proteome.

VCF with SNVs (optional)

To classify neoantigens (peptides derived from sample‑specific single nucleotide variants), you can provide a VCF file via --vcf-path / vcf_path:

Accepts plain or gzipped VCF: *.vcf or *.vcf.gz.
Only SNVs (single‑base reference and single‑base alternate) are used.
Multi‑allelic entries are expanded and processed per ALT allele.
Non‑SNV variants (indels, MNVs, etc.) are ignored.
Coordinates are matched to the reference via chromosome names that are normalized to strip the chr prefix (chr1 → 1).

DarkProfiler additionally filters SNVs to the coding exome using the exome.<reference>.bed file if present:

Only SNVs whose positions overlap the exome intervals are retained.
If no exome BED is available, all SNVs are accepted.

If vcf_path is omitted or points to a non‑existing file:

The SNV list is empty.
The mutanome and neoantigen steps still run, but represent the unmodified reference sequence.

Precomputed database directory (optional)

By default, each darkprofiler run invocation builds a database in:

<output_dir>/database/

The database contains translated and derived proteomes as FASTA files:

canonicalProteome.fa
alternativeSplicing.fa
mutanome.fa
mutatedCanonicalTranscriptome.fa
mutatedAlternativeTranslatome.fa

DarkProfiler also creates persistent fast indices under the same database directory to accelerate peptide search with Hamming distance: for example:

canonicalProteome.idx/
alternativeSplicing.idx/
mutanome.idx/
mutatedAlternativeORFeome.idx/

If you run DarkProfiler repeatedly with the same reference and SNV set, you can re‑use a prebuilt database to avoid recomputation by passing --database-path / database_path:

darkprofiler run hg38 peptides.fa out --database-path prebuilt_db/

The directory is accepted only if all required files are present. Otherwise:

DarkProfiler prints a warning that the directory is missing files or is invalid.
The directory is ignored.
A new database is built from scratch under <output_dir>/database.

For darkprofiler splice, the reusable database directory must contain canonicalProteome.fa. If it also contains canonicalProteome.splice.idx/, DarkProfiler reuses that two‑block exact splice index. If the splice index is missing or incomplete and --database-path was supplied, DarkProfiler uses a scan fallback rather than rebuilding the index.

Command‑line usage

The installed CLI is called darkprofiler.

Run darkprofiler --help to see the top‑level usage:

usage: darkprofiler [-h] {download,run,splice} ...

Three subcommands are available:

darkprofiler download – download reference genome bundles.
darkprofiler run – run the classification pipeline.
darkprofiler splice – run spliced peptide classification.

`download` subcommand

darkprofiler download hg38

`run` subcommand

darkprofiler run hg38 peptides.fa output_dir \
  --vcf-path sample.vcf.gz \
  --database-path /path/to/database \
  --num-threads 8 \
  --hamming 2

Optional arguments

--vcf-path FILE

Optional path to a VCF or VCF.GZ file with SNVs.
--database-path DIR

Optional path to an existing database directory containing the required FASTA files listed above.
--num-threads N (default: 1)

Number of worker threads used during peptide search / verification.
-k, --hamming {0,1,2} (default: 0)

Maximum Hamming distance allowed for peptide matching.
0 performs exact matches only; 1 and 2 allow up to one or two amino‑acid substitutions.

`splice` subcommand

darkprofiler splice peptides.fa hg38 output_dir \
  --database-path /path/to/splice_database \
  --num-threads 8

The positional arguments are:

peptide_fasta

Peptide FASTA file to classify.
reference_fasta

Either hg19, hg38, mm10, or mm39, or a path to an existing protein FASTA file.
output_dir

Output directory.

Optional arguments

--database-path DIR

Optional path to an existing splice database directory containing canonicalProteome.fa.
--num-threads N (default: 1)

Number of worker threads used during peptide search / verification.

Python API

from darkprofiler.run import classify_peptides
from darkprofiler.splice import classify_spliced_peptides

classify_peptides(
    reference="hg38",
    peptide_fasta="peptides.fa",
    output_dir="output",
    vcf_path=None,
    database_path=None,
    num_threads=4,
    hamming_distance=0,
)

classify_spliced_peptides(
    peptide_fasta="peptides.fa",
    reference_fasta="hg38",
    output_dir="splice_output",
    database_path=None,
    num_threads=4,
)

Classification pipeline details

Overview of steps

Filter VCF to exome
Load transcriptome, CDS annotations, canonical transcript list
Build canonical / non‑canonical transcript sets
Build canonical proteome (CDS must start with ATG) and classify peptides
Build alternative splicing proteome (CDS must start with ATG) and classify peptides
Apply SNVs, build mutanome (CDS must start with ATG) and classify peptides
Build alternative ORFs (3 frames) and classify peptides
Write unaligned peptides and summary plots
Finalize

For darkprofiler splice, progress is reported with the same stderr bar format:

Prepare canonical proteome/database
Build or load splice index
Load reference proteome
Classify spliced peptides
Write FASTA outputs and pie chart
Finalize

Category definitions

CDS translation filter (ATG)
For CDS‑based proteomes (canonical proteome, alternative splicing, mutanome), CDS translations are included only when the CDS begins with ATG. This reduces false positives from incomplete or mis‑annotated CDS records.
ORF region labels
For alternative ORF hits, DarkProfiler labels the peptide start as:
- uORF (upstream of CDS start)
- intORF (out-of-frame peptdies from inside annotated CDS span)
- dORF (downstream of CDS end)
- lncRNA (no CDS annotation)
Spliced peptide seed strategy
The splice subcommand always uses two pigeonhole blocks and exact matching (hamming = 0) for peptides of length 7-12 amino acids. A peptide of length L is split into two blocks, and the minimum block size is 3 amino acids. If the whole peptide is a contiguous reference match, at least one of those exact blocks anchors the alignment and the peptide is written to non-spliced.fa. If the anchored alignment breaks at the first mismatch, the matched side is treated as one splice fragment and the remaining side is searched elsewhere in the same reference entry. The second fragment must match exactly, must not overlap the first fragment, and must have at least one amino acid separating the two fragments in the reference. When several second-fragment matches exist, the nearest valid match is selected.

Outputs

All outputs live in the specified output_dir.

FASTA category files

Each category is represented by a separate FASTA file in output_dir:

canonicalProteome.fa
alternativeSplicing.fa
neoantigen.fa
alternativeReadingFrame.fa
unknown.fa

For classification FASTAs (all except unknown.fa), each record uses:

> referencePeptide | TranscriptID | nucleotide coordinate on transcript | uORF/intORF/dORF/lncRNA/CDS
queryPeptide

referencePeptide: matched reference peptide sequence (substring from the reference proteome/ORF; same length as the query)
TranscriptID: transcript identifier (for alternative ORFs, this is the underlying transcript)
nucleotide coordinate on transcript: 1‑based transcript coordinate of the peptide start codon (frame‑aware for alternative ORFs)
uORF/intORF/dORF/lncRNA/CDS:
- CDS for canonical proteome / alternative splicing / neoantigen hits
- uORF, intORF, dORF, lncRNA for alternative ORF hits

Example:

> GILGFVFTL | ENST00000335137.4 | 1234 | CDS
GILGFVFTL

unknown.fa uses the original peptide IDs and sequences without additional fields.

For darkprofiler splice, the output FASTA files are:

non-spliced.fa
spliced.fa
unknown.fa

non-spliced.fa records use amino-acid coordinates on the reference protein:

>ADDFRLK | originalEntryIDFromReferenceFASTA | 606 | non-spliced
ADDFRLK

spliced.fa records use two amino-acid coordinates, one for each peptide fragment in peptide order:

>ADDF_RLK | originalEntryIDFromReferenceFASTA | 606_303 | spliced
ADDFRLK

In this example, 606 is the 1-based reference coordinate for ADDF, and 303 is the 1-based reference coordinate for RLK. The underscore marks the splice junction in the FASTA header peptide field and coordinate field; it is not present in the peptide sequence.

`pieChart.tsv`

A tab‑separated summary file with one line per category:

Category    Count
canonical   123
alternativeSplicing 45
neoantigen  7
alternativeReadingFrame 32
unknown     83

For darkprofiler splice, the categories are:

Category    Count
non-spliced 123
spliced     45
unknown     83

`pieChart.pdf`

A pie chart illustrating the fraction of peptides in each category is saved as pieChart.pdf.

Database reuse and performance tips

Reuse databases
Use --database-path to reuse a database directory containing the required FASTA files.
Persistent fast indices
DarkProfiler builds on‑disk indices (*.idx/) for fast peptide lookup with Hamming distance ≤ 2 using a pigeonhole (seed‑and‑verify) strategy. When an index directory exists, it is reused automatically.

The splice subcommand builds canonicalProteome.splice.idx/, a separate two‑block exact seed index for peptide lengths 7 through 12.
Multi‑threading
Increase --num-threads to speed up peptide search / verification on multi‑core machines.

Troubleshooting

Unsupported reference

The reference must be one of hg19, hg38, mm10, mm39.

Missing genome files

Run darkprofiler download <reference> in the same environment.

Large runtime

Increase --num-threads.
Use -k/--hamming 0 for exact matching only when appropriate.
Reuse databases and indices between runs.

License

DarkProfiler is released under the MIT License.

Citation

If you use DarkProfiler in a scientific publication, please cite it as:

(Updated citation information will be provided once an associated preprint or manuscript is available.)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Jun 3, 2026

0.3.0

Jun 3, 2026

0.2.6

Feb 6, 2026

0.2.5

Feb 6, 2026

0.2.4

Feb 6, 2026

0.2.3

Feb 6, 2026

0.2.2

Feb 6, 2026

0.2.1

Feb 6, 2026

0.2.0

Feb 6, 2026

0.1.3

Dec 3, 2025

0.1.2

Dec 3, 2025

0.1.1

Dec 2, 2025

0.1.0

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darkprofiler-0.3.1.tar.gz (34.6 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

darkprofiler-0.3.1-py3-none-any.whl (29.8 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file darkprofiler-0.3.1.tar.gz.

File metadata

Download URL: darkprofiler-0.3.1.tar.gz
Upload date: Jun 3, 2026
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`a76f7c61d66bcc13a153402eae6d80049998bfd488cb291d3e87ecdc5cdc4f90`
MD5	`3555c4b30a86b814e6e2552b68e48dcb`
BLAKE2b-256	`aee68fd62b46c369551a0cdf1bced73120f59a9ad7c0c5e21d94f4fcf754a995`

See more details on using hashes here.

File details

Details for the file darkprofiler-0.3.1-py3-none-any.whl.

File metadata

Download URL: darkprofiler-0.3.1-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e77dee9a526470630de6c0b528d50ad0500b76a80dc20cd428a306407354fa52`
MD5	`06a42c283ac8ee0e2d4b3bbba4a73239`
BLAKE2b-256	`b48d02c156eb804061e6b6af42f63a674e77be2998def2c0246cf0a167e33669`

See more details on using hashes here.

darkprofiler 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

DarkProfiler

Table of contents

Installation

Requirements

Install with pip (PyPI)

Install with conda (bioconda)

Reference genome data

Supported references

What gets downloaded

Input data

Peptide FASTA

Reference FASTA for spliced peptide mode

VCF with SNVs (optional)

Precomputed database directory (optional)

Command‑line usage

download subcommand

run subcommand

splice subcommand

Python API

Classification pipeline details

Overview of steps

Category definitions

Outputs

FASTA category files

pieChart.tsv

pieChart.pdf

Database reuse and performance tips

Troubleshooting

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`download` subcommand

`run` subcommand

`splice` subcommand

`pieChart.tsv`

`pieChart.pdf`