A Python toolkit for extracting SCCmec sequences from Staphylococcus whole genome sequences

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

SCCmecExtractor

A Python toolkit for extracting and typing SCCmec and non-mec SCC elements from Staphylococcus and Mammaliicoccus whole genome sequences. The tool identifies attachment (att) sites, extracts complete SCC elements and performs gene-level typing of mec and ccr gene complexes.

Note: extraction requires both att sites to be on the same contig as rlmH. This is a same-contig requirement with assembly fragmentation being the main source of extraction failure, particularly in non-aureus species.

Overview

SCCmecExtractor provides five CLI commands that work together to identify, extract and type SCCmec and non-mec SCC elements:

Command	Description
`sccmec-pipeline`	Master pipeline orchestrating all steps (recommended)
`sccmec-locate-att`	Locate attachment (att) sites in genomic sequences
`sccmec-extract`	Extract SCC elements bounded by att site pairs
`sccmec-type`	Type extracted elements or WGS by mec and ccr gene content
`sccmec-report`	Merge extraction and typing results into a unified report

Key Capabilities

FASTA-only mode — no GFF annotation required; rlmH detected via BLAST against a 70-species reference database
Non-mec SCC detection — extracts SCC elements that carry ccr genes but lack mec genes
Composite element detection — identifies tandem/nested SCC elements with multiple att site pairs
Gene-level typing — classifies mec complex (mecA, mecB, mecC, mecD allotypes) and ccr complex (ccrA/B, ccrC allotypes) via BLAST
Cross-genus support — validated on Staphylococcus (64 species) and Mammaliicoccus (6 species)

Installation
Requirements
Usage
Complete Workflow Examples
How It Works
Output Format
Troubleshooting
Citation
License
Contributing
Contact

Installation

Using Conda/Mamba (Recommended)

# Create a new environment
conda create -n sccmecextractor python=3.11
conda activate sccmecextractor

# Install dependencies
conda install -c conda-forge -c bioconda biopython blast

# Install Bakta only if you need GFF-based workflow
conda install -c conda-forge -c bioconda bakta

# Install SCCmecExtractor
pip install sccmecextractor

# Test that commands are available
sccmec-pipeline --help
sccmec-locate-att --help
sccmec-extract --help
sccmec-type --help
sccmec-report --help

If commands do not run, make sure the environment's bin/ directory is in your PATH:

export PATH="$CONDA_PREFIX/bin:$PATH"

Using pip

Note: installation with pip does not provide BLAST+ or Bakta. BLAST+ is required for ccr gene checks, typing and FASTA-only mode.

# Install SCCmecExtractor
pip install sccmecextractor

# Test that commands are available
sccmec-pipeline --help

Using Docker

Docker provides a containerised environment with all dependencies pre-installed, including BLAST+ and Bakta.

# Pull the pre-built image
docker pull alisonmacfadyen/sccmecextractor:latest

# Or build from source
git clone https://github.com/AlisonMacFadyen/SCCmecExtractor.git
cd SCCmecExtractor
docker build -t sccmecextractor:latest -f containers/Dockerfile .

Quick Start with Docker:

# FASTA-only mode (no Bakta database needed)
docker run --rm \
  -v $PWD:/work \
  sccmecextractor:latest \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With GFF annotation
docker run --rm \
  -v $PWD:/work \
  sccmecextractor:latest \
  sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

See CONTAINER_GUIDE.md for detailed Docker usage instructions.

Using Singularity

Singularity/Apptainer is ideal for HPC environments where Docker is not available.

# Pull from Docker Hub
singularity pull docker://alisonmacfadyen/sccmecextractor:latest

# Or build from definition file
singularity build sccmecextractor.sif containers/sccmecextractor.def

Quick Start with Singularity:

# FASTA-only mode
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With GFF annotation
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

See CONTAINER_GUIDE.md for detailed Singularity usage instructions.

Requirements

Dependencies

Python 3.9+
Biopython
BLAST+ (bundled in containers)
Bakta (optional, only needed for GFF-based workflow) and automatically included in containers

Input Files

Genome sequence: .fasta or .fna file containing the assembled genome

When using GFF mode:

Gene annotations: .gff3 file with gene annotations (we recommend Bakta for annotation)

When using FASTA-only mode (--blast-rlmh): No annotation file is needed rlmH is detected via BLAST against a bundled 70-species reference database

Bakta Database

Only required if using Bakta for GFF annotation:

# Light database (faster, smaller)
bakta_db download --output bakta_db --type light

# Full database
bakta_db download --output bakta_db

Usage

FASTA-only Mode (Recommended)

The simplest way to run SCCmecExtractor. No GFF annotation or Bakta database is required — rlmH is detected via BLAST. If using a novel species, recommended to supply rlmH gene reference.

# Single genome
sccmec-pipeline -f genome.fna -o results/

# Multiple genomes
sccmec-pipeline -f genome1.fna genome2.fna genome3.fna -o results/

# Directory of genomes
sccmec-pipeline --fna-dir genomes/ -o results/

# With composite element detection and multithreading
sccmec-pipeline --fna-dir genomes/ --composite -o results/ -t 4

GFF Mode

If you have pre-computed GFF3 annotations (e.g. from Bakta), you can provide them. GFF files are matched to FASTA files by stem name.

# Single genome with GFF
sccmec-pipeline -f genome.fna -g genome.gff3 -o results/

# Directories of genomes and GFFs
sccmec-pipeline --fna-dir genomes/ --gff-dir annotations/ -o results/

Command Reference

`sccmec-pipeline`

Master pipeline that runs all steps: att site location, extraction, typing and report generation.

sccmec-pipeline [-h] (-f FNA [FNA ...] | --fna-dir FNA_DIR)
                [-g GFF [GFF ...] | --gff-dir GFF_DIR] [--blast-rlmh]
                [--rlmh-ref RLMH_REF] [--composite] -o OUTDIR [-t THREADS]

Argument	Description
`-f`, `--fna`	One or more FASTA/FNA genome files
`--fna-dir`	Directory of FASTA/FNA genome files
`-g`, `--gff`	One or more GFF3 annotation files (matched by stem name)
`--gff-dir`	Directory of GFF3 files (matched by stem name)
`--blast-rlmh`	Use BLAST for rlmH detection (auto-enabled when no GFF provided)
`--rlmh-ref`	Custom rlmH reference FASTA for BLAST detection
`--composite`	Extract to outermost boundary for composite elements
`-o`, `--outdir`	Output directory for all results
`-t`, `--threads`	Number of parallel threads (default: 1)

`sccmec-locate-att`

Identifies attachment sites in genomic sequences.

sccmec-locate-att [-h] -f FNA [-g GFF] -o OUTFILE [--blast-rlmh] [--rlmh-ref RLMH_REF]

Argument	Description
`-f`, `--fna`	Input genome file (.fasta or .fna)
`-g`, `--gff`	Gene annotation file (.gff3 format, optional)
`-o`, `--outfile`	Output TSV file containing att site locations
`--blast-rlmh`	Use BLAST for rlmH detection (auto-enabled when no GFF provided)
`--rlmh-ref`	Custom rlmH reference FASTA

`sccmec-extract`

Extracts SCC elements bounded by att site pairs.

sccmec-extract [-h] -f FNA [-g GFF] -a ATT -s SCCMEC [--composite] [-r REPORT]
               [--blast-rlmh] [--rlmh-ref RLMH_REF]

Argument	Description
`-f`, `--fna`	Input genome file (.fasta or .fna)
`-g`, `--gff`	Gene annotation file (.gff3 format, optional)
`-a`, `--att`	TSV file from `sccmec-locate-att` with att site locations
`-s`, `--sccmec`	Output directory for extracted SCC sequences
`--composite`	Extract to outermost boundary for composite elements
`-r`, `--report`	Output TSV file for extraction report (appends for batch)
`--blast-rlmh`	Use BLAST for rlmH detection (auto-enabled when no GFF provided)
`--rlmh-ref`	Custom rlmH reference FASTA

`sccmec-type`

Types extracted SCC elements (or whole genomes) by mec and ccr gene content using BLAST.

sccmec-type [-h] -f FASTA [FASTA ...] -o OUTFILE [--mec-ref MEC_REF] [--ccr-ref CCR_REF]

Argument	Description
`-f`, `--fasta`	Input FASTA file(s) or directory of extracted SCC elements
`-o`, `--outfile`	Output TSV file for typing results
`--mec-ref`	Custom mec gene reference FASTA (default: bundled)
`--ccr-ref`	Custom ccr gene reference FASTA (default: bundled)

`sccmec-report`

Merges extraction and typing results into a unified report.

sccmec-report [-h] -e EXTRACTION_REPORT -t TYPING_RESULTS -o OUTFILE

Argument	Description
`-e`, `--extraction-report`	TSV from `sccmec-extract --report`
`-t`, `--typing-results`	TSV from `sccmec-type`
`-o`, `--outfile`	Output unified report TSV

Complete Workflow Examples

FASTA-only Mode (Recommended)

# One-liner for a directory of genomes
sccmec-pipeline --fna-dir genomes/ -o results/ -t 4

GFF Mode with Bakta Annotation

# 1. Annotate your genome with Bakta
bakta --db bakta_db genome.fna --output bakta_output

# 2. Run the pipeline with GFF
sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/

Step-by-Step (Individual Commands)

# 1. Locate att sites
sccmec-locate-att -f genome.fna -o att_sites/att_sites.tsv

# 2. Extract SCC elements
sccmec-extract -f genome.fna -a att_sites/att_sites.tsv -s sccmec/ -r extraction_report.tsv

# 3. Type extracted elements and whole genome
sccmec-type -f sccmec/ genome.fna -o typing_results.tsv

# 4. Generate unified report
sccmec-report -e extraction_report.tsv -t typing_results.tsv -o unified_report.tsv

Docker

# FASTA-only mode (simplest)
docker run --rm \
  -v $PWD:/work \
  sccmecextractor:latest \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With Bakta annotation
docker run --rm \
  -v $PWD:/work \
  -v ~/bakta_db:/data/bakta_db \
  sccmecextractor:latest \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/
  "

Singularity

# FASTA-only mode (simplest)
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  sccmec-pipeline --fna-dir genomes/ -o results/

# With Bakta annotation
singularity exec \
  --bind $PWD:/work \
  --bind ~/bakta_db:/data/bakta_db \
  sccmecextractor.sif \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-pipeline -f genome.fna -g bakta_output/genome.gff3 -o results/
  "

How It Works

Attachment Site Detection

The tool searches for 24 DNA motif patterns (8 attR/cattR + 16 attL/cattL) representing the attachment sites that flank SCC elements. These patterns use regex with degeneracy to account for sequence variation across species. attR sites are anchored within the rlmH gene, which serves as the chromosomal integration site.

attR/cattR: Right attachment sites (within rlmH), including CcrC-associated attR2/cattR2 and attR3/cattR3 variants
attL/cattL: Left attachment sites, including multiple bridge-length variants (attL2-attL8) covering diverse ccr-mediated recombination products

Extraction Logic

rlmH Detection: Locates the rlmH gene via GFF annotation or BLAST against a 70-species reference database
Site Validation: attR sites must fall within or near rlmH (within 100 bp)
Pair Selection: Identifies the closest attR-attL pair on the same contig
ccr Validation: Verifies ccr genes are present between the att sites. Elements without ccr are classified as no_ccr_element (not extracted)
Size Filtering: Rejects artefacts < 1,000 bp (pattern overlaps) and spurious matches > 200,000 bp
Composite Detection: Identifies tandem/nested elements with multiple att site pairs (with --composite)
Origin-Spanning Detection: Flags elements where chromosome linearisation splits the SCC across contig ends
Fallback Extraction: When standard extraction fails, but attL is identified, fallback extraction is utilised using the location of rlmH as a proxy for attR
Strand Awareness: Automatically handles reverse complement extraction when necessary

Gene-Level Typing

sccmec-type carries out gene-typing by BLAST-based detection of:

mec gene: mecA, mecB, mecC mecD allotypes
ccr complex: ccrA/ccrB pairs and ccrC allotypes, incorporating all 22 ccr complex types.

Output Format

Pipeline Output Directory Structure

results/
├── ambiguous_att_sites.tsv # collates extraction limited genomes that may be of interest
├── att_sites/              # att site locations (one TSV per genome)
│   └── *.tsv
├── sccmec/                 # Extracted SCC element FASTAs
│   └── *_SCCmec.fasta
├── typing/                 # Typing results for extracted + whole genomes
│   └── *.tsv
├── extraction_report.tsv   # Per-genome extraction status and coordinates
├── typing_results.tsv      # Gene-level typing for all inputs
└── sccmec_unified_report.tsv  # Merged extraction + typing report

Unified Report

The unified report (sccmec_unified_report.tsv) includes per-genome columns for extraction status, att site coordinates, element size, mec/ccr gene content, allotype classifications and typing method e.g. sccmec or wgs.

Troubleshooting

Common Issues

No att sites found:

May be genuine due to non-matching patterns, however:

Verify that the input FASTA file is properly formatted
If using GFF mode, ensure the GFF3 file corresponds to the same genome assembly

No rlmH gene found:

In FASTA-only mode, rlmH is detected via BLAST. If your species has a highly divergent rlmH, provide a custom reference with --rlmh-ref
In GFF mode, verify that gene annotation was performed correctly and rlmH is annotated as such
- The tool expects either the gene annotated as rlmH or the product is named "Ribosomal RNA large subunit methyltransferase H" (case-insenstive)
rlmH is a conserved housekeeping gene — its absence may indicate an assembly issue

Missing attR-attL pairs (cross_contig):

Caused by assembly fragmentation, the att sites are on different contigs
Check the att site TSV output to see which sites were detected and on which contigs
Long-read sequencing or hybrid assembly can improve extraction rates

Missing attR-attL pairs (right_only):

Missing att pairs will most commonly occur due to a missing attL, therefore SCC elements cannot be extracted.
This is a current limitation of the tool

BLAST+ not found:

BLAST+ is required for FASTA-only mode (--blast-rlmh), typing and ccr location checks
Install via conda: conda install -c bioconda blast
BLAST+ is pre-installed in Docker and Singularity containers

Container-specific issues:

See CONTAINER_GUIDE.md for troubleshooting Docker and Singularity problems

Warning Messages

The tools provide informative warning messages to help diagnose issues:

Missing gene annotations
Incomplete att site pairs
File processing errors

Citation

If you use SCCmecExtractor in your research, please cite this repository:

MacFadyen, A.C. SCCmecExtractor: A toolkit for extracting and typing SCCmec elements
from Staphylococcus and Mammaliicoccus genomes.
GitHub repository: https://github.com/AlisonMacFadyen/SCCmecExtractor

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Contact

Email: alison.macfadyen86@gmail.com

Acknowledgments

Bakta for bacterial genome annotation
The Biopython project for sequence manipulation tools
BLAST+ for sequence comparison/alignment

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

amacfadyen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

Apr 7, 2026

This version

1.3.0

Mar 1, 2026

1.2.3

Feb 7, 2026

1.2.2

Feb 7, 2026

1.2.1

Feb 6, 2026

1.1.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sccmecextractor-1.3.0.tar.gz (85.2 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sccmecextractor-1.3.0-py3-none-any.whl (66.1 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file sccmecextractor-1.3.0.tar.gz.

File metadata

Download URL: sccmecextractor-1.3.0.tar.gz
Upload date: Mar 1, 2026
Size: 85.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sccmecextractor-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a74193d280c61ec20eede512452e041b95f62f7754d0cee3d1f904b6bc6b8181`
MD5	`65803db59be419e6ba4730dc32ef3cdb`
BLAKE2b-256	`08af0b35999fbe83a9bc06b98f29e28fbf40db09461f26c5431e34971ba0ddc6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.3.0.tar.gz:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sccmecextractor-1.3.0.tar.gz
- Subject digest: a74193d280c61ec20eede512452e041b95f62f7754d0cee3d1f904b6bc6b8181
- Sigstore transparency entry: 1006544631
- Sigstore integration time: Mar 1, 2026
Source repository:
- Permalink: AlisonMacFadyen/SCCmecExtractor@8e229c1251286bb96cd8a6cf005a2e83ff285000
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/AlisonMacFadyen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd-pypi.yaml@8e229c1251286bb96cd8a6cf005a2e83ff285000
- Trigger Event: release

File details

Details for the file sccmecextractor-1.3.0-py3-none-any.whl.

File metadata

Download URL: sccmecextractor-1.3.0-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 66.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sccmecextractor-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a861c35b0d7722150d1d593ef6a0dc42db4dda311f05d995cc3f75a6e331326e`
MD5	`c7c5b262d8be9ad7a5e550476260def0`
BLAKE2b-256	`814ce09005e1451284681b1b3448e5efc184e0c78d675110bbf5beb00431a04e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.3.0-py3-none-any.whl:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sccmecextractor-1.3.0-py3-none-any.whl
- Subject digest: a861c35b0d7722150d1d593ef6a0dc42db4dda311f05d995cc3f75a6e331326e
- Sigstore transparency entry: 1006544640
- Sigstore integration time: Mar 1, 2026
Source repository:
- Permalink: AlisonMacFadyen/SCCmecExtractor@8e229c1251286bb96cd8a6cf005a2e83ff285000
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/AlisonMacFadyen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd-pypi.yaml@8e229c1251286bb96cd8a6cf005a2e83ff285000
- Trigger Event: release

sccmecextractor 1.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SCCmecExtractor

Overview

Key Capabilities

Table of Contents

Installation

Using Conda/Mamba (Recommended)

Using pip

Using Docker

Using Singularity

Requirements

Dependencies

Input Files

Bakta Database

Usage

FASTA-only Mode (Recommended)

GFF Mode

Command Reference

sccmec-pipeline

sccmec-locate-att

sccmec-extract

sccmec-type

sccmec-report

Complete Workflow Examples

FASTA-only Mode (Recommended)

GFF Mode with Bakta Annotation

Step-by-Step (Individual Commands)

Docker

Singularity

How It Works

Attachment Site Detection

Extraction Logic

Gene-Level Typing

Output Format

Pipeline Output Directory Structure

Unified Report

Troubleshooting

Common Issues

Warning Messages

Citation

License

Contributing

Contact

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`sccmec-pipeline`

`sccmec-locate-att`

`sccmec-extract`

`sccmec-type`

`sccmec-report`