Skip to main content

A Python toolkit for extracting SCCmec sequences from Staphylococcus whole genome sequences

Project description

SCCmecExtractor

A Python toolkit for extracting and typing SCCmec and non-mec SCC elements from Staphylococcus and Mammaliicoccus whole genome sequences. The tool identifies attachment (att) sites, extracts complete SCC elements and performs gene-level typing of mec and ccr gene complexes.

Note: extraction requires both att sites to be on the same contig as rlmH. This is a same-contig requirement with assembly fragmentation being the main source of extraction failure, particularly in non-aureus species.

License: MIT Python 3.9+ PyPI version Docker Image Version

CD - PyPI CD - Docker

Overview

SCCmecExtractor provides five CLI commands that work together to identify, extract and type SCCmec and non-mec SCC elements:

Command Description
sccmec-pipeline Master pipeline orchestrating all steps (recommended)
sccmec-locate-att Locate attachment (att) sites in genomic sequences
sccmec-extract Extract SCC elements bounded by att site pairs
sccmec-type Type extracted elements or WGS by mec and ccr gene content
sccmec-report Merge extraction and typing results into a unified report

Key Capabilities

  • FASTA-only mode — no GFF annotation required; rlmH detected via BLAST against a 70-species reference database
  • Non-mec SCC detection — extracts SCC elements that carry ccr genes but lack mec genes
  • Composite element detection — identifies tandem/nested SCC elements with multiple att site pairs
  • Gene-level typing — classifies mec complex (mecA, mecB, mecC, mecD allotypes) and ccr complex (ccrA/B, ccrC allotypes) via BLAST
  • Cross-genus support — validated on Staphylococcus (64 species) and Mammaliicoccus (6 species)

Table of Contents

Installation

Using Conda/Mamba (Recommended)

# Create a new environment
conda create -n sccmecextractor python=3.11
conda activate sccmecextractor

# Install dependencies
conda install -c conda-forge -c bioconda biopython blast

# Install Bakta only if you need GFF-based workflow
conda install -c conda-forge -c bioconda bakta

# Install SCCmecExtractor
pip install sccmecextractor

# Test that commands are available
sccmec-pipeline --help
sccmec-locate-att --help
sccmec-extract --help
sccmec-type --help
sccmec-report --help

If commands do not run, make sure the environment's bin/ directory is in your PATH:

export PATH="$CONDA_PREFIX/bin:$PATH"

Using pip

Note: installation with pip does not provide BLAST+ or Bakta. BLAST+ is required for ccr gene checks, typing and FASTA-only mode.

# Install SCCmecExtractor
pip install sccmecextractor

# Test that commands are available
sccmec-pipeline --help

Using Docker

Docker provides a containerised environment with all dependencies pre-installed, including BLAST+ and Bakta.

# Pull the pre-built image
docker pull alisonmacfadyen/sccmecextractor:latest

# Or build from source
git clone https://github.com/AlisonMacFadyen/SCCmecExtractor.git
cd SCCmecExtractor
docker build -t sccmecextractor:latest -f containers/Dockerfile .

Quick Start with Docker:

# FASTA-only mode (no Bakta database needed)
docker run --rm \
  -v $PWD:/work \
  alisonmacfadyen/sccmecextractor:latest \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With GFF annotation
docker run --rm \
  -v $PWD:/work \
  alisonmacfadyen/sccmecextractor:latest \
  sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

See CONTAINER_GUIDE.md for detailed Docker usage instructions.

Using Singularity

Singularity/Apptainer is ideal for HPC environments where Docker is not available.

# Pull from Docker Hub
singularity pull docker://alisonmacfadyen/sccmecextractor:latest

# Or build from definition file
singularity build sccmecextractor.sif containers/sccmecextractor.def

Quick Start with Singularity:

# FASTA-only mode
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With GFF annotation
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

See CONTAINER_GUIDE.md for detailed Singularity usage instructions.

Requirements

Dependencies

  • Python 3.9+
  • Biopython
  • BLAST+ (bundled in containers)
  • Bakta (optional, only needed for GFF-based workflow) and automatically included in containers

Input Files

  • Genome sequence: .fasta or .fna file containing the assembled genome

When using GFF mode:

  • Gene annotations: .gff3 file with gene annotations (we recommend Bakta for annotation)

When using FASTA-only mode (--blast-rlmh): No annotation file is needed rlmH is detected via BLAST against a bundled 70-species reference database

Bakta Database

Only required if using Bakta for GFF annotation:

# Light database (faster, smaller)
bakta_db download --output bakta_db --type light

# Full database
bakta_db download --output bakta_db

Usage

FASTA-only Mode (Recommended)

The simplest way to run SCCmecExtractor. No GFF annotation or Bakta database is required — rlmH is detected via BLAST. If using a novel species, recommended to supply rlmH gene reference.

# Single genome
sccmec-pipeline -f genome.fna -o results/

# Multiple genomes
sccmec-pipeline -f genome1.fna genome2.fna genome3.fna -o results/

# Directory of genomes
sccmec-pipeline --fna-dir genomes/ -o results/

# With composite element detection and multithreading
sccmec-pipeline --fna-dir genomes/ --composite -o results/ -t 4

GFF Mode

If you have pre-computed GFF3 annotations (e.g. from Bakta), you can provide them. GFF files are matched to FASTA files by stem name.

# Single genome with GFF
sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

# Directories of genomes and GFFs
sccmec-pipeline --fna-dir genomes/ --gff-dir annotations/ -o results/

Command Reference

sccmec-pipeline

Master pipeline that runs all steps: att site location, extraction, typing and report generation.

sccmec-pipeline [-h] (-f FNA [FNA ...] | --fna-dir FNA_DIR)
                [-g GFF [GFF ...] | --gff-dir GFF_DIR] [--blast-rlmh]
                [--rlmh-ref RLMH_REF] [--composite] -o OUTDIR [-t THREADS]
Argument Description
-f, --fna One or more FASTA/FNA genome files
--fna-dir Directory of FASTA/FNA genome files
-g, --gff One or more GFF3 annotation files (matched by stem name)
--gff-dir Directory of GFF3 files (matched by stem name)
--blast-rlmh Use BLAST for rlmH detection (auto-enabled when no GFF provided)
--rlmh-ref Custom rlmH reference FASTA for BLAST detection
--composite Extract to outermost boundary for composite elements
-o, --outdir Output directory for all results
-t, --threads Number of parallel threads (default: 1)

sccmec-locate-att

Identifies attachment sites in genomic sequences.

sccmec-locate-att [-h] -f FNA [-g GFF] -o OUTFILE [--blast-rlmh] [--rlmh-ref RLMH_REF]
Argument Description
-f, --fna Input genome file (.fasta or .fna)
-g, --gff Gene annotation file (.gff3 format, optional)
-o, --outfile Output TSV file containing att site locations
--blast-rlmh Use BLAST for rlmH detection (auto-enabled when no GFF provided)
--rlmh-ref Custom rlmH reference FASTA

sccmec-extract

Extracts SCC elements bounded by att site pairs.

sccmec-extract [-h] -f FNA [-g GFF] -a ATT -s SCCMEC [--composite] [-r REPORT]
               [--blast-rlmh] [--rlmh-ref RLMH_REF]
Argument Description
-f, --fna Input genome file (.fasta or .fna)
-g, --gff Gene annotation file (.gff3 format, optional)
-a, --att TSV file from sccmec-locate-att with att site locations
-s, --sccmec Output directory for extracted SCC sequences
--composite Extract to outermost boundary for composite elements
-r, --report Output TSV file for extraction report (appends for batch)
--blast-rlmh Use BLAST for rlmH detection (auto-enabled when no GFF provided)
--rlmh-ref Custom rlmH reference FASTA

sccmec-type

Types extracted SCC elements (or whole genomes) by mec and ccr gene content using BLAST.

sccmec-type [-h] -f FASTA [FASTA ...] -o OUTFILE [--mec-ref MEC_REF] [--ccr-ref CCR_REF]
Argument Description
-f, --fasta Input FASTA file(s) or directory of extracted SCC elements
-o, --outfile Output TSV file for typing results
--mec-ref Custom mec gene reference FASTA (default: bundled)
--ccr-ref Custom ccr gene reference FASTA (default: bundled)

sccmec-report

Merges extraction and typing results into a unified report.

sccmec-report [-h] -e EXTRACTION_REPORT -t TYPING_RESULTS -o OUTFILE
Argument Description
-e, --extraction-report TSV from sccmec-extract --report
-t, --typing-results TSV from sccmec-type
-o, --outfile Output unified report TSV

Complete Workflow Examples

FASTA-only Mode (Recommended)

# One-liner for a directory of genomes
sccmec-pipeline --fna-dir genomes/ -o results/ -t 4

GFF Mode with Bakta Annotation

# 1. Annotate your genome with Bakta
bakta --db bakta_db genome.fna --output bakta_output

# 2. Run the pipeline with GFF
sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/

Step-by-Step (Individual Commands)

# 1. Locate att sites
sccmec-locate-att -f genome.fna -o att_sites/att_sites.tsv

# 2. Extract SCC elements
sccmec-extract -f genome.fna -a att_sites/att_sites.tsv -s sccmec/ -r extraction_report.tsv

# 3. Type extracted elements and whole genome
sccmec-type -f sccmec/ genome.fna -o typing_results.tsv

# 4. Generate unified report
sccmec-report -e extraction_report.tsv -t typing_results.tsv -o unified_report.tsv

Docker

# FASTA-only mode (simplest)
docker run --rm \
  -v $PWD:/work \
  alisonmacfadyen/sccmecextractor:latest \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With Bakta annotation
docker run --rm \
  -v $PWD:/work \
  -v ~/bakta_db:/data/bakta_db \
  alisonmacfadyen/sccmecextractor:latest \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/
  "

Singularity

# FASTA-only mode (simplest)
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With Bakta annotation
singularity exec \
  --bind $PWD:/work \
  --bind ~/bakta_db:/data/bakta_db \
  sccmecextractor.sif \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/
  "

How It Works

Attachment Site Detection

The tool searches for 24 DNA motif patterns (8 attR/cattR + 16 attL/cattL) representing the attachment sites that flank SCC elements. These patterns use regex with degeneracy to account for sequence variation across species. attR sites are anchored within the rlmH gene, which serves as the chromosomal integration site.

  • attR/cattR: Right attachment sites (within rlmH), including CcrC-associated attR2/cattR2 and attR3/cattR3 variants
  • attL/cattL: Left attachment sites, including multiple bridge-length variants (attL2-attL8) covering diverse ccr-mediated recombination products

Extraction Logic

  1. rlmH Detection: Locates the rlmH gene via GFF annotation or BLAST against a 70-species reference database
  2. Site Validation: attR sites must fall within or near rlmH (within 100 bp)
  3. Pair Selection: Identifies the closest attR-attL pair on the same contig
  4. ccr Validation: Verifies ccr genes are present between the att sites. Elements without ccr are classified as no_ccr_element (not extracted)
  5. Size Filtering: Rejects artefacts < 1,000 bp (pattern overlaps) and spurious matches > 200,000 bp
  6. Composite Detection: Identifies tandem/nested elements with multiple att site pairs (with --composite)
  7. Origin-Spanning Detection: Flags elements where chromosome linearisation splits the SCC across contig ends
  8. Fallback Extraction: When standard extraction fails, but attL is identified, fallback extraction is utilised using the location of rlmH as a proxy for attR
  9. Strand Awareness: Automatically handles reverse complement extraction when necessary

Gene-Level Typing

sccmec-type carries out gene-typing by BLAST-based detection of:

  • mec gene: mecA, mecB, mecC mecD allotypes
  • ccr complex: ccrA/ccrB pairs and ccrC allotypes, incorporating all 22 ccr complex types.

Output Format

Pipeline Output Directory Structure

results/
├── ambiguous_att_sites.tsv # collates extraction limited genomes that may be of interest
├── att_sites/              # att site locations (one TSV per genome)
│   └── *.tsv
├── sccmec/                 # Extracted SCC element FASTAs
│   └── *_SCCmec.fasta
├── typing/                 # Typing results for extracted + whole genomes
│   └── *.tsv
├── extraction_report.tsv   # Per-genome extraction status and coordinates
├── typing_results.tsv      # Gene-level typing for all inputs
└── sccmec_unified_report.tsv  # Merged extraction + typing report

Unified Report

The unified report (sccmec_unified_report.tsv) includes per-genome columns for extraction status, att site coordinates, element size, mec/ccr gene content, allotype classifications and typing method e.g. sccmec or wgs.

Troubleshooting

Common Issues

No att sites found:

May be genuine due to non-matching patterns, however:

  • Verify that the input FASTA file is properly formatted
  • If using GFF mode, ensure the GFF3 file corresponds to the same genome assembly

No rlmH gene found:

  • In FASTA-only mode, rlmH is detected via BLAST. If your species has a highly divergent rlmH, provide a custom reference with --rlmh-ref
  • In GFF mode, verify that gene annotation was performed correctly and rlmH is annotated as such
    • The tool expects either the gene annotated as rlmH or the product is named "Ribosomal RNA large subunit methyltransferase H" (case-insenstive)
  • rlmH is a conserved housekeeping gene — its absence may indicate an assembly issue

Missing attR-attL pairs (cross_contig):

  • Caused by assembly fragmentation, the att sites are on different contigs
  • Check the att site TSV output to see which sites were detected and on which contigs
  • Long-read sequencing or hybrid assembly can improve extraction rates

Missing attR-attL pairs (right_only):

  • Missing att pairs will most commonly occur due to a missing attL, therefore SCC elements cannot be extracted.
  • This is a current limitation of the tool

BLAST+ not found:

  • BLAST+ is required for FASTA-only mode (--blast-rlmh), typing and ccr location checks
  • Install via conda: conda install -c bioconda blast
  • BLAST+ is pre-installed in Docker and Singularity containers

Container-specific issues:

Warning Messages

The tools provide informative warning messages to help diagnose issues:

  • Missing gene annotations
  • Incomplete att site pairs
  • File processing errors

Citation

If you use SCCmecExtractor in your research, please cite this repository:

MacFadyen, A.C. SCCmecExtractor: A toolkit for extracting and typing SCCmec elements
from Staphylococcus and Mammaliicoccus genomes.
GitHub repository: https://github.com/AlisonMacFadyen/SCCmecExtractor

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Contact

Email: alison.macfadyen86@gmail.com

Acknowledgments

  • Bakta for bacterial genome annotation
  • The Biopython project for sequence manipulation tools
  • BLAST+ for sequence comparison/alignment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sccmecextractor-1.4.0.tar.gz (86.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sccmecextractor-1.4.0-py3-none-any.whl (66.4 kB view details)

Uploaded Python 3

File details

Details for the file sccmecextractor-1.4.0.tar.gz.

File metadata

  • Download URL: sccmecextractor-1.4.0.tar.gz
  • Upload date:
  • Size: 86.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sccmecextractor-1.4.0.tar.gz
Algorithm Hash digest
SHA256 9092ceeee977a6db536039748570f8f9dbfd65c73b1b83075f1b796dec51151e
MD5 d1df7df6a4a31f21cd2d0b71528d3330
BLAKE2b-256 8ae0ee7fcbb79dbf7b36854653ce5ca2e784cad4703e9491d10202a9ff6ca2b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.4.0.tar.gz:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sccmecextractor-1.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sccmecextractor-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af06a493ffbfe4fc509641ff371635ad6b9a3e41c0e475f2b66db75d49676961
MD5 c2096bbfa279e6614345c6df1a81ec44
BLAKE2b-256 98eda5e6a48ecd1ffc2d3effa9d4f1e037592f17ac8e2adbd65ba5dca96e2141

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.4.0-py3-none-any.whl:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page