Skip to main content

Tools for graph based and string based mapping and remapping genomic ↔ transcript ↔ aminoacid sequences.

Project description

deeplotyper

CI & Release Publish Python Package PyPI version Documentation Status

deeplotyper

Tools for mapping and remapping genomic ↔ transcript sequences.

Deeplotyper is a Python toolkit for genomic and transcriptomic sequence analysis that focuses on mapping coordinates between genomes and transcripts, applying variant haplotypes (sets of SNVs/indels) to reference sequences, and extracting open reading frames (ORFs) as either linear sequences or graph representations. It is designed as an academically rigorous, transparent alternative to traditional variant effect prediction tools. Deeplotyper’s core modules enable fine-grained control and interpretation of complex genetic variants without reliance on large external databases or opaque heuristics.

Installation

pip install deeplotyper

Requires:

  • Python ≥ 3.8
  • Biopython
  • pysam

Quickstart

from deeplotyper import (
    SequenceCoordinateMapper,
    HaplotypeRemapper,
    HaplotypeGroups,
    find_orfs, get_longest_orf,
    make_aligner, apply_alignment_gaps,
    build_linear_coords, build_raw_genome_coords, build_raw_transcript_coords,
    BaseCoordinateMapping, CodonCoordinateMapping,
    SequenceMappingResult, TranscriptMappingResult,
    HaplotypeEvent, NewTranscriptSequences, RawBase
)

# 1. Map a transcript to the genome
mapper = SequenceCoordinateMapper()
results = mapper.map_transcripts(
    genome_metadata={"seq_region_accession": "chr1", "start": 100, "strand": 1},
    full_genomic_sequence="ATGGGGTTTCCC...",
    exon_definitions_by_transcript={
        "tx1": [
            {"exon_number": 1, "start": 100, "end": 102, "sequence": "ATG"},
            
        ]
    },
    transcript_sequences={"tx1": "ATGCCC"},
    exon_orders={"tx1": [1]},
    min_block_length=5
)

# 2. Apply SNV/indel haplotypes
hap_map = {
    (
        HaplotypeEvent(pos0=2, ref_allele="A", alt_seq="G"),
    ): ()
}
remapper = HaplotypeRemapper("ATGAAA...", results)
mutated = remapper.apply_haplotypes(hap_map)

# 3. Group samples by haplotype from a VCF
groups = HaplotypeGroups.from_vcf(
    "variants.vcf.gz",
    ref_seq="ATGAAA...",
    contig="1",
    start=0
)
distinct = groups.materialize()

Sequence Coordinate Mapping (SequenceCoordinateMapper)

One foundational feature of Deeplotyper is coordinate mapping between genomic DNA and transcript (cDNA/mRNA) coordinates. The SequenceCoordinateMapper class constructs an internal mapping between a reference sequence (e.g. a genomic region) and one or more transcript definitions (exons/introns structure). This allows conversion of coordinates in both directions (genome → transcript and transcript → genome).

For example, given a gene’s reference DNA sequence and exon coordinates for multiple transcripts (splice variants), the mapper can:

  • Translate a genomic position to a position within a transcript (cDNA coordinate).
  • Identify which exon or intron a mutation falls into.
  • Account for strand orientation and splicing (including reverse-complement mappings).

By building a precise base-level map of exonic regions, SequenceCoordinateMapper provides the groundwork for consistent variant placement across transcripts and enables downstream analyses like coding sequence extraction.

Implementation detail: Internally, the mapper may produce a linear coordinate index for each transcript relative to the reference. For instance, if Transcript A has exons 1–100 and 201–300 on the reference genome, a coordinate like genomic 250 can be mapped to position 150 of Transcript A’s cDNA.

Haplotype Remapping (HaplotypeRemapper)

Deeplotyper supports applying a set of genetic variants — collectively forming a haplotype — onto reference sequences or transcripts. The HaplotypeRemapper class takes a SequenceCoordinateMapper and a haplotype map (a collection of variants such as SNVs, insertions, deletions, or complex multi-nucleotide changes) and remaps the reference sequence to produce the altered (haplotype) sequence.

  • Ensures variants are applied in the correct positions across multi-exon transcripts.
  • Handles insertions and deletions (indels), adjusting downstream coordinates.
  • Supports complex events like multi-base substitutions or combinations of proximal variants.
  • Can model gene fusions or structural rearrangements by mapping coordinates from two reference sequences into one combined transcript.

The output is typically a new sequence (e.g. the mutated cDNA), along with diffs or lists of changed positions for full transparency.

ORF Extraction (find_orfs and get_longest_orf)

To assess coding impacts, Deeplotyper can extract open reading frames (ORFs) from sequences:

  • find_orfs scans a nucleotide sequence to identify all ORFs bounded by start and stop codons in the correct reading frame.
  • get_longest_orf retrieves the longest ORF from a given sequence.

These functions help reveal variant-induced effects such as novel start codons, truncated proteins, or frameshifts. Graph representations of ORFs (nodes = exons/segments, edges = splice connections) are also supported for visualizing complex haplotypes.

Sequence Alignment (make_aligner and apply_alignment_gaps)

When visualizing indels, Deeplotyper provides utilities for pairwise sequence alignment:

  • make_aligner returns a configured Biopython PairwiseAligner (global or local modes).
  • apply_alignment_gaps projects alignment gaps onto coordinate mappings or sequence strings, inserting dashes (‐) to show indels.

Example alignment output:

Ref: ATGCCCACGT...
Alt: ATG--ACGT...

This aids in interpreting frameshifts or in-frame indels and their effects on codon numbering.

Linear Coordinate Construction (build_linear_coords)

The build_linear_coords utility flattens a spliced transcript into a continuous cDNA or protein coordinate space and maps it back to genomic coordinates. Useful for:

  • Creating lookup tables (e.g. transcript→genome).
  • Plotting gene models.
  • Adjusting coordinates after indels in haplotype transcripts.

Example Use Case

from deeplotyper import SequenceCoordinateMapper, HaplotypeRemapper, find_orfs, get_longest_orf

# 1. Reference sequence (toy example)
gene_name = "GENE1"
chrom = "chr1"
strand = "+"

reference_seq = (
    "ATGGTcacct...TTAG"
)

# 2. Exon definitions for two transcripts
transcript1_exons = [(1, 300), (401, 600)]
transcript2_exons = [(1, 300), (501, 700)]

transcripts = {
    "Transcript1": {"exons": transcript1_exons, "strand": "+", "cds_start": 1, "cds_end": 600},
    "Transcript2": {"exons": transcript2_exons, "strand": "+", "cds_start": 1, "cds_end": 700}
}

mapper = SequenceCoordinateMapper(reference_seq, transcripts)

# 3. Define a haplotype (list of variant dicts)
haplotype = [
    {"pos": 50,  "ref": "G",   "alt": "T"},
    {"pos": 310, "ref": "",    "alt": "ACG"},
    {"pos": 450, "ref": "AGCT","alt": ""},
    {"pos": 480, "ref": "A",   "alt": "TT"},
]

remapper = HaplotypeRemapper(mapper, haplotype)

mut_seq_t1 = remapper.get_sequence("Transcript1")
mut_seq_t2 = remapper.get_sequence("Transcript2")

print(f"Transcript1 (mutated) length: {len(mut_seq_t1)}")
print(mut_seq_t1[40:60])

# 5. ORF extraction in mutated Transcript1
orfs = find_orfs(mut_seq_t1, assume_start_codon=True)
longest_orf = get_longest_orf(mut_seq_t1)
print(f"Number of ORFs: {len(orfs)}")
print(f"Longest ORF length: {len(longest_orf)}")

Addressing Limitations of VEP and Haplosaurus

Traditional VEP/Haplosaurus workflows have known limitations:

  • Complex variant support: Doesn’t natively handle gene fusions, multi-exon deletions, or intronic/splice-site changes. Deeplotyper applies any user-specified set of variants.
  • Database dependency: Requires multi-GB Ensembl caches and compiled APIs. Deeplotyper is pure-Python and works on user-provided sequences/coords.
  • Edge cases: Can fail on short transcripts or produce opaque “high impact” labels. Deeplotyper’s transparent implementation traces frameshifts and disrupted sequences.
  • Opacity: VEP uses black-box predictors (SIFT/PolyPhen). Deeplotyper exposes explicit sequence changes, enabling direct inspection of altered codons or ORFs.

License

MIT

Contributing

We welcome contributions! Feel free to add requests in the issues section or directly contribute with a pull request.

Citations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplotyper-2025.10.2a0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplotyper-2025.10.2a0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file deeplotyper-2025.10.2a0.tar.gz.

File metadata

  • Download URL: deeplotyper-2025.10.2a0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for deeplotyper-2025.10.2a0.tar.gz
Algorithm Hash digest
SHA256 0ef4b2ed5794d822ba68fed52b118187dada195db102d84168da486c99ea6045
MD5 b3cbc80af5451bfe287662b1f6068f0b
BLAKE2b-256 f5c75de1238d87a9431aeb5049ced66d054d192d660df64889a80ded1b6022e8

See more details on using hashes here.

File details

Details for the file deeplotyper-2025.10.2a0-py3-none-any.whl.

File metadata

File hashes

Hashes for deeplotyper-2025.10.2a0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d0a964d3a305a5778537b526ea39afc6aaa19ed3c78b31ceae943d512b8c787
MD5 64afa1e4845acb65fdbf56e621fa6078
BLAKE2b-256 8c52d56d75c0f5b9dfcd77dc9af4d191cc429d13cd2bdc96d2ab47108b943172

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page