Fast Python library for plasmid annotation and makeability scoring

Project description

PlasmidKit

PlasmidKit is a fast Python library and CLI for annotating plasmid sequences and estimating a synthesis/assembly "makeability" score. It focuses on engineered plasmids (2–10 kb typical), validates that backbone essentials are present, and produces an interpretable quality score.

Quick Start

# Install with uv
uv sync

# Run the CLI
uv run plasmidkit --help

# Or install with pip
pip install -e .

What It Evaluates

Backbone Components:

Origins of replication (e.g., ColE1/pMB1, p15A, pSC101, RSF)
Selectable markers (e.g., AmpR/blaTEM, KanR/nptII, CmR/cat)
Promoters (e.g., lac, T7, CMV) and terminators
ORFs via Prodigal to confirm coding potential exists

Synthesis & Assembly Hygiene:

Length optimization (2–6 kb ideal)
GC content (45–55% ideal)
Repeat sequences and palindromes
Homopolymer runs
Forbidden motifs (e.g., BsaI, BsmBI, NotI)

Example Usage

Python API

import json
import plasmidkit as pk

# Load a plasmid sequence
record = pk.load_record("tests/data/pUC19.fasta")

# Annotate features
annotations = pk.annotate(record)

# Fast annotation (skip ORF prediction)
annotations_fast = pk.annotate(record, skip_prodigal=True)

# Annotate a raw sequence string (must specify is_sequence=True)
raw_seq = "ATCG..."
annotations_raw = pk.annotate(raw_seq, is_sequence=True)

# Calculate quality score
score_report = pk.score(record, annotations=annotations)

# Display results
print(f"Found {len(annotations)} features")
print(f"Overall score: {score_report['total']:.1f}/100")

# Show first few annotations
for ann in annotations[:3]:
    print(f"  {ann.type}: {ann.id} at {ann.start}-{ann.end}")

Real Output (pUC19)

Annotations (first 5):

[
  {
    "type": "rep_origin",
    "id": "ColE1",
    "start": 2314,
    "end": 2903,
    "strand": "+",
    "method": "motif_fuzzy",
    "confidence": 1.0
  },
  {
    "type": "rep_origin",
    "id": "pBR322_origin",
    "start": 2298,
    "end": 2917,
    "strand": "-",
    "method": "motif_fuzzy",
    "confidence": 1.0
  },
  {
    "type": "marker",
    "id": "TEM-116",
    "start": 1283,
    "end": 2144,
    "strand": "+",
    "method": "motif_fuzzy",
    "confidence": 1.0
  },
  {
    "type": "promoter",
    "id": "AmpR_promoter_4",
    "start": 1178,
    "end": 1283,
    "strand": "+",
    "method": "motif_fuzzy",
    "confidence": 1.0
  },
  {
    "type": "promoter",
    "id": "lac_promoter",
    "start": 540,
    "end": 569,
    "strand": "-",
    "method": "motif_fuzzy",
    "confidence": 1.0
  }
]

Score Report:

{
  "total": 22.46,
  "components": {
    "length": 15.0,
    "gc": 10.0,
    "repeats": -0.54,
    "palindromes": -2.0,
    "homopolymers": -0.0,
    "BsaI": -1.0,
    "BsmBI": -1.0,
    "forbidden_motifs": -2.0,
    "ori_recognition": 8.0,
    "marker_recognition": -8.0,
    "promoter_terminator": 4.0,
    "burden": 0.0
  }
}

Command Line

# Annotate a plasmid
uv run plasmidkit annotate tests/data/pUC19.fasta

# Skip slow ORF prediction
uv run plasmidkit annotate tests/data/pUC19.fasta --skip-prodigal

# Get detailed output with scores
uv run plasmidkit annotate tests/data/pUC19.fasta --output report.json

# View all detected features
uv run plasmidkit annotate tests/data/pUC19.fasta --verbose

How It Works

DNA Motif Matching: Uses pyahocorasick for fast multi-pattern scanning of known sequences
- Supports fuzzy matching with configurable mismatches
- Handles circular plasmids correctly (wrap-around search)
ORF Prediction: Runs Prodigal (pyrodigal) to identify coding regions
- Validates that the plasmid has protein-coding potential
- No exhaustive protein identification needed
Sequence Analysis: Evaluates synthesis/assembly properties
- GC content, length optimization
- Detects repeats, palindromes, homopolymers
- Flags forbidden restriction sites
Scoring System:
```
Total Score = Synthesis Hygiene + Backbone Recognition + Assembly Burden
```
- Synthesis: length (15), GC (10), minus penalties for repeats/palindromes/homopolymers/forbidden motifs
- Backbone: ori (8), marker (6), promoter/terminator (4-7)
- Burden: penalties for high-copy + strong promoter combinations

Feature Detection

PlasmidKit identifies:

Feature Type	Detection Method	Examples
rep_origin	Motif matching	ColE1, pMB1, p15A, pSC101, RSF1030
marker	Motif matching + aliases	AmpR (TEM-1), KanR (nptII), CmR (cat)
promoter	Motif matching	lac, T7, CMV, AmpR promoter
terminator	Motif matching	rrnB T1, T7 terminator
cds	Prodigal ORF prediction	Any plausible coding sequence

Data Sources

We curate signatures from public sources, with per-entry citations in plasmidkit/data/engineered_core_signatures.json:

PlasMapper features API (promoters/terminators/origins)
- Portal: https://plasmapper.wishartlab.com/search
- API: https://plasmapper.ca/api/features
NCBI ori sequences via query:
- origin_of_replication[All Fields] AND (bacteria[filter] AND plasmid[filter])
- Citation: https://www.ncbi.nlm.nih.gov/nuccore/<ACCESSION>
UniProt (Swiss-Prot) for reviewed markers:
- Examples: blaTEM-1 (P62593), nptII (P00552)
- API: https://rest.uniprot.org/uniprotkb/{accession}
CARD (Comprehensive Antibiotic Resistance Database):
- Protein homolog models for bacterial AMR determinants
- PHM entries: beta-lactamases, aminoglycoside-modifying enzymes, etc.
SnapGene Standard Features export:
- Engineered backbone motifs: promoters, terminators, origins, markers
- Short DNA motifs for fast exact/fuzzy matching
- Citation: { "database": "SnapGene", "source": "Standard Features export" }
pLannotate bundle indices:
- SnapGene/FPbase/Swiss-Prot indices
- Rfam models for RNA features

Testing

PlasmidKit includes comprehensive tests that run automatically via GitHub Actions on every push.

Run tests locally:

# All tests
uv run pytest tests/ -v

# With coverage
uv run pytest tests/ --cov=plasmidkit --cov-report=html

# Specific test
uv run pytest tests/test_api.py::test_annotate_and_score -v

Test coverage includes:

Annotation accuracy across multiple plasmids (pUC19, pSC101, etc.)
Score calculation and component breakdown
Feature detection for ori, markers, promoters, terminators
Edge cases and circular sequence handling

Development

# Clone the repository
git clone https://github.com/McClain-Thiel/plasmid-kit.git
cd plasmid-kit

# Install with development dependencies
uv sync

# Run tests
uv run pytest

# Format code
uv run black plasmidkit/ tests/
uv run ruff check plasmidkit/ tests/

Data Caching

Large database files are not stored in git. They're cached under plasmidkit/data/_cache/ and ignored by .gitignore.

Default cache directory can be overridden with PLASMIDKIT_CACHE
To prefetch caches for offline use:

uv run python -m plasmidkit.cli bootstrap --cache-dir plasmidkit/data/_cache

This warms up the built-in engineered-core@1.0.0 database. Optional external indices (BLAST/Rfam/SnapGene/SwissProt) can be placed under the cache dir if available.

Design Philosophy

Focus on Backbone Recognition:

PlasmidKit targets engineered backbone essentials rather than exhaustive protein identification
Motif-based detection for known functional elements (fast, interpretable)
ORF prediction confirms coding potential exists (no protein ID needed)

CDS Detection Strategy:

Motif matches for common selectable markers (presence implies function)
ORF prediction (via Prodigal/pyrodigal) to confirm plausible coding regions

Why not exhaustive annotation?

Engineered plasmids have predictable backbones
Exact/fuzzy motif matching is fast and accurate for known parts
Full protein BLAST is slow and often unnecessary for quality assessment

License

See LICENSE file for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Citation

If you use PlasmidKit in your research, please cite:

[Citation information to be added]

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Dec 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plasmidkit-0.1.0.tar.gz (783.1 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plasmidkit-0.1.0-py3-none-any.whl (842.8 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file plasmidkit-0.1.0.tar.gz.

File metadata

Download URL: plasmidkit-0.1.0.tar.gz
Upload date: Dec 16, 2025
Size: 783.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for plasmidkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`be6212594a41efe6b1270414ede2b6e15b54b1abf9a867ee68c1551864f57858`
MD5	`2ce77da80a841ed28548c7a987835a69`
BLAKE2b-256	`dab7124d4b1468d1cb64cd8359069390f7ce9f4984019c3f64cdd2e39e59d218`

See more details on using hashes here.

File details

Details for the file plasmidkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: plasmidkit-0.1.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 842.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for plasmidkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08e3c3b201d0fd48a89647c699a003c879b4824ed3f77cd93099d8def241d18f`
MD5	`defe80c7f2659a0854cc98ce3a8cb88d`
BLAKE2b-256	`bade19b231f009847cc2cf3c79a0aaeaca893bd58dc3dd5e99f6e4dce6341047`

See more details on using hashes here.

plasmidkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PlasmidKit

Quick Start

What It Evaluates

Example Usage

Python API

Real Output (pUC19)

Command Line

How It Works

Feature Detection

Data Sources

Testing

Development

Data Caching

Design Philosophy

License

Contributing

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes