Skip to main content

Read, write, and manipulate SnapGene .dna files

Project description

SnapGene File Format Parser

SnapGene File Format Parser (SGFFP for short) is a reverse-engineered parser for SnapGene DNA, RNA, and protein file formats.

[!Important] Hey! I have tried to decode as many different SnapGene blocks as I can, but surely something must be missing. This is why I ask you to check your SnapGene file(s) with uv run sff check <your_snapgene_file> to see which blocks your file has. If you have a new, unknown block type it will notify you with [NEW] flag Please open an issue and, if possible, either attach your file or dump the output of the block with the --examine/-e flag, i.e. uv run sff check <your_snapgene_file> -e 1> block.dump. Let's make parsing SnapGene files better together!

The parser reads SnapGene files into Python objects and exports to JSON, with a writer for creating new SnapGene files.

The project aims to be a minimalistic, fast, and useful tool for molecular biologists who need to parse large libraries of SnapGene files, or for developers building SnapGene-compatible applications.

Architecture

flowchart LR
    subgraph Input
        DNA[".dna file"]
        Bytes["bytes/stream"]
    end

    subgraph SGFFP
        Reader["SgffReader"]
        Object["SgffObject"]
        Writer["SgffWriter"]
    end

    subgraph Output
        JSON["JSON"]
        File[".dna file"]
    end

    DNA --> Reader
    Bytes --> Reader
    Reader --> Object
    Object --> Writer
    Object --> JSON
    Writer --> File

Installation

git clone https://github.com/merv1n34k/sgffp.git
cd sgffp
uv sync

Quick Start

from sgffp import SgffReader, SgffWriter

# Read a SnapGene file
sgff = SgffReader.from_file("plasmid.dna")

# Access data via typed properties
print(sgff.sequence.value)
print(sgff.features[0].name)

# Modify and write back
sgff.sequence.topology = "circular"
SgffWriter.to_file(sgff, "output.dna")

CLI Tool

uv run sff check plasmid.dna    # Inspect file blocks
uv run sff parse plasmid.dna    # Export to JSON
uv run sff info plasmid.dna     # Show file information

File Format

SnapGene uses a Type-Length-Value (TLV) binary format where each block contains:

Field Size Description
Type 1 byte Block type identifier
Length 4 bytes Payload size (big-endian)
Data N bytes Block payload

Data encoding varies by block type: UTF-8 for sequences, XML for annotations, 2-bit encoding for compressed DNA (GATC → 00/01/10/11), and LZMA compression for history blocks.

Block Types

All known SnapGene block types and their encoding formats:

ID Block Type Format ID Block Type Format
0 DNA Sequence UTF-8 17 Alignable Sequences XML
1 Compressed DNA 2-bit GATC 18 Sequence Trace ZTR
5 Primers XML 21 Protein Sequence UTF-8
6 Notes XML 28 Enzyme Visualization XML
7 History Tree LZMA + XML 29 History Modifier LZMA + XML
8 Sequence Properties XML 30 History Content LZMA + TLV
10 Features XML 32 RNA Sequence UTF-8
11 History Nodes Binary + TLV 14 Custom Enzymes XML

Blocks not listed (2-4, 9, 12-13, 15-16, 19-20, 22-27, 31) are either unknown or internal SnapGene data.

Supported Block Types

The table below shows which block types can be read from and written to SnapGene files. Blocks marked with a Model have typed Python classes for convenient access (e.g., sgff.sequence, sgff.features, sgff.history).

ID Block Type Read Write Model
0 DNA Sequence + + +
1 Compressed DNA + + +
5 Primers (XML) + + +
6 Notes (XML) + + +
7 History Tree (XML) + + +
8 Sequence Properties (XML) + + +
10 Features (XML) + + +
11 History Nodes + + +
14 Custom Enzymes (XML) + + -
17 Alignable Sequences (XML) + + +
21 Protein Sequence + + +
28 Enzyme Visualization (XML) + + -
29 History Modifier (XML) + + +
30 History Content (Nested) + + +
32 RNA Sequence + + +

Roadmap

  • Improve SGFF parsing, unify TLV strategy
  • Understand whole file structure
  • Correctly parse into readable format from all common blocks
  • Create writer for supported block types
  • Add comprehensive test suite (199 tests)
  • Parse XML into pure JSON format
  • Add write support for history blocks (LZMA compression)
  • Add typed model classes for easy data access
  • Documentation improvements

Acknowledgments

This project would not have been possible without previous work done by

License

Distributed under MIT licence, see LICENSE for more.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sgffp-0.9.0.tar.gz (112.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sgffp-0.9.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file sgffp-0.9.0.tar.gz.

File metadata

  • Download URL: sgffp-0.9.0.tar.gz
  • Upload date:
  • Size: 112.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sgffp-0.9.0.tar.gz
Algorithm Hash digest
SHA256 4544bee2acae1d65c172fef93c33515176567a4800c6ebab4da8822748ada278
MD5 cb70c156b339c7fbf057136ced18a537
BLAKE2b-256 5c1619c9b09fc3fa6c04b1a4b56eae0ee1ba7b1da04fcd2bf0d12356ce90fe31

See more details on using hashes here.

File details

Details for the file sgffp-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: sgffp-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sgffp-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 90122a8ba9cca2a194199d033a0c2ee613ed145a5011f430685281e102fe8961
MD5 5406c2b0a9a213a8af9f314dcc044626
BLAKE2b-256 67993715b3ef8c8d67f43fc3f55c9d9e562483efae34b06d936d0028e3459e3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page