DarkProfiler: Alignment and Classification of Peptides from Reference-Independent De Novo Peptide Sequencing Experiments.

These details have not been verified by PyPI

Project description

DarkProfiler

DarkProfiler: Alignment and Classification of Peptides from Reference‑Independent De Novo Peptide Sequencing Experiments

DarkProfiler

DarkProfiler takes peptide sequences (e.g., from reference‑independent de novo peptide sequencing) and classifies them into distinct categories using reference genomes and optional sample‑specific SNVs:

Canonical proteome
Alternative splicing
Neoantigens (SNV‑derived mutanome)
Alternative reading frame peptides
Unknown / unaligned

DarkProfiler is intended to be the post‑processing / annotation step after de novo peptide calling or customized peptide discovery in proteomics or immunopeptidomics experiments.

Supported reference assemblies:

Human: hg19 (GENCODE release 19), hg38 (GENCODE release 37)
Mouse: mm10 (GENCODE release M19), mm39 (GENCODE release M37)

The same logic is available both as a command‑line tool and as a Python API.

Installation
Reference genome data
- Supported references
- What gets downloaded
Input data
Command‑line usage
Python API
- Function reference
- Python examples
Classification pipeline details
- Overview of steps
- Category definitions
Outputs
Database reuse and performance tips
Troubleshooting
License
Citation

Installation

Requirements

Python: 3.7+ (tested on modern CPython versions)
Operating systems: Linux, macOS, and other UNIX‑like systems should work. Windows with WSL is recommended.
Python dependencies (installed automatically via pip/conda):
- Biopython (FASTA parsing and sequence utilities)
- matplotlib (for pieChart.pdf)
- Standard library modules only otherwise

You also need sufficient disk space to store:

A reference genome bundle per assembly (hundreds of MB)
The database directory (translated proteomes + fast indices) per output folder
The final classification FASTA files and plots

Install with pip (PyPI)

pip install darkprofiler

This installs:

The Python package darkprofiler
The command‑line entry point darkprofiler

You should then be able to run:

darkprofiler --help

Install with conda (bioconda)

conda install bioconda::darkprofiler

This will install DarkProfiler together with all dependencies into the active conda environment.

Reference genome data

Supported references

DarkProfiler currently supports human and mouse reference assemblies that are aligned to GENCODE releases:

hg19 (GENCODE release 19)
hg38 (GENCODE release 37)
mm10 (GENCODE release M19)
mm39 (GENCODE release M37)

The reference is always specified by one of the lower‑case strings:

hg19
hg38
mm10
mm39

Internally the reference is normalized to lower case, so HG38 and hg38 are treated the same in the Python API, but the CLI restricts choices to the canonical lower‑case names.

What gets downloaded

Reference data are distributed as versioned ZIP bundles hosted online. You do not need to download or unpack them manually. Use:

darkprofiler download hg38

This will:

Check that the requested reference is supported.
Download a file named like darkprofiler_hg38.zip to the installed package directory under darkprofiler/genome/.

Extract the contents to:

<python-site-packages>/darkprofiler/genome/hg38/

Print progress messages such as:

[darkprofiler] Downloading ...
[darkprofiler] Extracting to ...
[darkprofiler] Finished. Reference 'hg38' is now available.

The extracted directory contains at least the following files (names may include version tags):

transcriptome.<reference>.fa – all reference transcripts (FASTA)
transcriptome.<reference>.cds.bed – CDS segments per transcript
knownCanonical.<reference>.list – list of canonical transcript IDs
gencode.<reference>.gff – GENCODE annotation (GFF/GTF‑like)
exome.<reference>.bed – exome intervals used to filter SNVs

These files are used internally by the pipeline; you normally don’t need to interact with them directly.

Note: If the download step has not been run for a given reference, darkprofiler run will fail with an error such as “Could not find file ... in genome root”.

Input data

Peptide FASTA

The primary input is a FASTA file containing peptide sequences to classify:

>peptide_1
LLLLGIGGTFK
>peptide_2
EAVAEQAALR
...

Requirements and recommendations:

Each record is interpreted as a peptide (amino‑acid sequence).
FASTA IDs are kept as‑is and propagated to the output files.
Sequences are upper‑cased internally; non‑standard characters are not specially treated.
Empty sequences are silently ignored.
There is no hard limit on peptide length, but short peptides may match many locations and very long peptides may be rare.

A peptide sequence is assigned to at most one output category within a given hamming distance, corresponding to the first category that matches in the pipeline (canonical → alternative splicing → neoantigen → alternative reading frame → unknown).

VCF with SNVs (optional)

To classify neoantigens (peptides derived from sample‑specific single nucleotide variants), you can provide a VCF file via --vcf-path / vcf_path:

Accepts plain or gzipped VCF: *.vcf or *.vcf.gz.
Only SNVs (single‑base reference and single‑base alternate) are used.
Multi‑allelic entries are expanded and processed per ALT allele.
Non‑SNV variants (indels, MNVs, etc.) are ignored.
Coordinates are matched to the reference via chromosome names that are normalized to strip the chr prefix (chr1 → 1).

DarkProfiler additionally filters SNVs to the coding exome using the exome.<reference>.bed file if present:

Only SNVs whose positions overlap the exome intervals are retained.
If no exome BED is available, all SNVs are accepted.

If vcf_path is omitted or points to a non‑existing file:

The SNV list is empty.
The mutanome and neoantigen steps still run, but represent the unmodified reference sequence.

Precomputed database directory (optional)

By default, each darkprofiler run invocation builds a database in:

<output_dir>/database/

The database contains translated and derived proteomes as FASTA files:

canonicalProteome.fa
alternativeSplicing.fa
mutanome.fa
mutatedCanonicalTranscriptome.fa
mutatedAlternativeTranslatome.fa

DarkProfiler also creates persistent fast indices under the same database directory to accelerate peptide search with Hamming distance: for example:

canonicalProteome.idx/
alternativeSplicing.idx/
mutanome.idx/
mutatedAlternativeORFeome.idx/

If you run DarkProfiler repeatedly with the same reference and SNV set, you can re‑use a prebuilt database to avoid recomputation by passing --database-path / database_path:

darkprofiler run hg38 peptides.fa out --database-path prebuilt_db/

The directory is accepted only if all required files are present. Otherwise:

DarkProfiler prints a warning that the directory is missing files or is invalid.
The directory is ignored.
A new database is built from scratch under <output_dir>/database.

Command‑line usage

The installed CLI is called darkprofiler.

Run darkprofiler --help to see the top‑level usage:

usage: darkprofiler [-h] {download,run} ...

Two subcommands are available:

darkprofiler download – download reference genome bundles.
darkprofiler run – run the classification pipeline.

`download` subcommand

darkprofiler download hg38

`run` subcommand

darkprofiler run hg38 peptides.fa output_dir \
  --vcf-path sample.vcf.gz \
  --database-path /path/to/database \
  --num-threads 8 \
  --hamming 2

Optional arguments

--vcf-path FILE

Optional path to a VCF or VCF.GZ file with SNVs.
--database-path DIR

Optional path to an existing database directory containing the required FASTA files listed above.
--num-threads N (default: 1)

Number of worker threads used during peptide search / verification.
-k, --hamming {0,1,2} (default: 0)

Maximum Hamming distance allowed for peptide matching.
0 performs exact matches only; 1 and 2 allow up to one or two amino‑acid substitutions.

Python API

from darkprofiler.run import classify_peptides

classify_peptides(
    reference="hg38",
    peptide_fasta="peptides.fa",
    output_dir="output",
    vcf_path=None,
    database_path=None,
    num_threads=4,
    hamming_distance=0,
)

Classification pipeline details

Overview of steps

Filter VCF to exome
Load transcriptome, CDS annotations, canonical transcript list
Build canonical / non‑canonical transcript sets
Build canonical proteome (CDS must start with ATG) and classify peptides
Build alternative splicing proteome (CDS must start with ATG) and classify peptides
Apply SNVs, build mutanome (CDS must start with ATG) and classify peptides
Build alternative ORFs (3 frames) and classify peptides
Write unaligned peptides and summary plots
Finalize

Category definitions

CDS translation filter (ATG)
For CDS‑based proteomes (canonical proteome, alternative splicing, mutanome), CDS translations are included only when the CDS begins with ATG. This reduces false positives from incomplete or mis‑annotated CDS records.
ORF region labels
For alternative ORF hits, DarkProfiler labels the peptide start as:
- uORF (upstream of CDS start)
- intORF (out-of-frame peptdies from inside annotated CDS span)
- dORF (downstream of CDS end)
- lncRNA (no CDS annotation)

Outputs

All outputs live in the specified output_dir.

FASTA category files

Each category is represented by a separate FASTA file in output_dir:

canonicalProteome.fa
alternativeSplicing.fa
neoantigen.fa
alternativeReadingFrame.fa
unknown.fa

For classification FASTAs (all except unknown.fa), each record uses:

> referencePeptide | TranscriptID | nucleotide coordinate on transcript | uORF/intORF/dORF/lncRNA/CDS
queryPeptide

referencePeptide: matched reference peptide sequence (substring from the reference proteome/ORF; same length as the query)
TranscriptID: transcript identifier (for alternative ORFs, this is the underlying transcript)
nucleotide coordinate on transcript: 1‑based transcript coordinate of the peptide start codon (frame‑aware for alternative ORFs)
uORF/intORF/dORF/lncRNA/CDS:
- CDS for canonical proteome / alternative splicing / neoantigen hits
- uORF, intORF, dORF, lncRNA for alternative ORF hits

Example:

> GILGFVFTL | ENST00000335137.4 | 1234 | CDS
GILGFVFTL

unknown.fa uses the original peptide IDs and sequences without additional fields.

`pieChart.tsv`

A tab‑separated summary file with one line per category:

Category    Count
canonical   123
alternativeSplicing 45
neoantigen  7
alternativeReadingFrame 32
unknown     83

`pieChart.pdf`

A pie chart illustrating the fraction of peptides in each category is saved as pieChart.pdf.

Database reuse and performance tips

Reuse databases
Use --database-path to reuse a database directory containing the required FASTA files.
Persistent fast indices
DarkProfiler builds on‑disk indices (*.idx/) for fast peptide lookup with Hamming distance ≤ 2 using a pigeonhole (seed‑and‑verify) strategy. When an index directory exists, it is reused automatically.
Multi‑threading
Increase --num-threads to speed up peptide search / verification on multi‑core machines.

Troubleshooting

Unsupported reference

The reference must be one of hg19, hg38, mm10, mm39.

Missing genome files

Run darkprofiler download <reference> in the same environment.

Large runtime

Increase --num-threads.
Use -k/--hamming 0 for exact matching only when appropriate.
Reuse databases and indices between runs.

License

DarkProfiler is released under the MIT License.

Citation

If you use DarkProfiler in a scientific publication, please cite it as:

(Updated citation information will be provided once an associated preprint or manuscript is available.)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

Feb 6, 2026

0.2.5

Feb 6, 2026

This version

0.2.4

Feb 6, 2026

0.2.3

Feb 6, 2026

0.2.2

Feb 6, 2026

0.2.1

Feb 6, 2026

0.2.0

Feb 6, 2026

0.1.3

Dec 3, 2025

0.1.2

Dec 3, 2025

0.1.1

Dec 2, 2025

0.1.0

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darkprofiler-0.2.4.tar.gz (26.4 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

darkprofiler-0.2.4-py3-none-any.whl (21.9 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file darkprofiler-0.2.4.tar.gz.

File metadata

Download URL: darkprofiler-0.2.4.tar.gz
Upload date: Feb 6, 2026
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`72755950c343d58deb7cdedf03ab14dab500bd318469adc8dcce33f8e9570fa2`
MD5	`074855aac54a505bd67f5064abf5fffc`
BLAKE2b-256	`9406d60bfa6314f6c4aa6615c48b05fcb7cfcf7572dfbf43a75ea1b281d68538`

See more details on using hashes here.

File details

Details for the file darkprofiler-0.2.4-py3-none-any.whl.

File metadata

Download URL: darkprofiler-0.2.4-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.28.2 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.67.1 CPython/3.8.10

File hashes

Hashes for darkprofiler-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f918810b7121f3aecc3297f8da1fe11eb1d06fcdac561a505c6a1ef90a43a24`
MD5	`e99f252d182c9ebc3511eb009fdb8365`
BLAKE2b-256	`c2e4fa96ceb41ebb12b9d963636163126f003c71bd8ee2cd5f13553030193820`

See more details on using hashes here.

darkprofiler 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

DarkProfiler

Table of contents

Installation

Requirements

Install with pip (PyPI)

Install with conda (bioconda)

Reference genome data

Supported references

What gets downloaded

Input data

Peptide FASTA

VCF with SNVs (optional)

Precomputed database directory (optional)

Command‑line usage

download subcommand

run subcommand

Python API

Classification pipeline details

Overview of steps

Category definitions

Outputs

FASTA category files

pieChart.tsv

pieChart.pdf

Database reuse and performance tips

Troubleshooting

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`download` subcommand

`run` subcommand

`pieChart.tsv`

`pieChart.pdf`