A Python toolkit for extracting SCCmec sequences from Staphylococcus whole genome sequences

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

SCCmecExtractor

A Python toolkit for extracting SCCmec (Staphylococcal Cassette Chromosome mec) sequences from Staphylococcus whole genome sequences. This tool identifies attachment (att) sites and extracts the complete SCCmec element based on genomic context.

Note the tool is quite stringent and requires the att sites to be located on the same contig as each other and as the gene rlmH in order to extract the DNA sequence of the SCCmec

Overview

SCCmecExtractor consists of two main scripts that work together to identify and extract SCCmec sequences:

locate_att_sites.py - Identifies attachment sites in genomic sequences
- Canonical attR sites: attR and the complement, cattR
- Divergent CcrC associated attR2 and the complement, cattR2
- Canonical attL and the complement, cattL
- Divergent CcrC associated attL2 and the complement, cattL2
extract_SCCmec.py - Extracts the SCCmec sequence based on identified att sites and gene annotations

Installation
Requirements
Usage
Complete Workflow
How It Works
Output Format
Troubleshooting
Citation
License
Contributing
Contact

Installation

Using Conda/Mamba

# Create a new environment
conda create -n sccmecextractor python=3.11
conda activate sccmecextractor

# Install dependencies
conda install -c conda-forge -c bioconda biopython bakta

# Install SCCmecExtractor
pip install sccmecextractor

# Test that scripts are available
sccmec-locate-att --help
sccmec-extract --help

If scripts do not run, make sure the environment’s bin/ directory is in your PATH:

export PATH="$CONDA_PREFIX/bin:$PATH"

Using pip

Note, installation with pip does not provide Bakta.

# Install SCCmecExtractor
pip install sccmecextractor

# Test that scripts are available
sccmec-locate-att --help
sccmec-extract --help

Using Docker

Docker provides a containerised environment with all dependencies pre-installed, including Bakta.

# Pull the pre-built image
docker pull alisonmacfadyen/sccmecextractor:latest

# Or build from source
git clone https://github.com/AlisonMacFadyen/SCCmecExtractor.git
cd SCCmecExtractor
docker build -t sccmecextractor:latest -f containers/Dockerfile .

Quick Start with Docker:

# Download Bakta Database (light in this example)

# Create a directory for the Bakta database
mkdir -p ~/bakta_db

# Download using Docker
docker run --rm -v ~/bakta_db/:/data/bakta_db \
  sccmecextractor:latest \
  bakta_db download --output /data/bakta_db --type light

# Run the complete pipeline
docker run --rm \
  -v $PWD:/work \
  -v ~/bakta_db:/data/bakta_db \
  sccmecextractor:latest \
  bash -c "bakta --db /data/bakta_db genome.fna.gz --output bakta_out && \
           sccmec-locate-att -f genome.fna -g bakta_out/genome.gff3 -o att_sites.tsv && \
           sccmec-extract -f genome.fna -g bakta_out/genome.gff3 -a att_sites.tsv -s output"

See CONTAINER_GUIDE.md for detailed Docker usage instructions.

Using Singularity

Singularity/Apptainer is ideal for HPC environments where Docker is not available.

# Build from definition file
singularity build sccmecextractor.sif containers/sccmecextractor.def

# Or pull from Docker Hub
singularity pull docker://alisonmacfadyen/sccmecextractor:latest

Quick Start with Singularity:

# Download Bakta Database (light in this example)
singularity exec \
  --bind $PWD:/work \
  sccmecextractor.sif \
  bakta_db download --output ~/bakta_db --type light

# Run the complete pipeline
singularity exec \
  --bind $PWD:/work \
  --bind ~/bakta_db:/data/bakta_db \
  sccmecextractor.sif \
  bash -c "bakta --db /data/bakta_db genome.fna --output bakta_out && \
           sccmec-locate-att -f genome.fna -g bakta_out/genome.gff3 -o att_sites.tsv && \
           sccmec-extract -f genome.fna -g bakta_out/genome.gff3 -a att_sites.tsv -s output"

See CONTAINER_GUIDE.md for detailed Singularity usage instructions.

Requirements

Dependencies

Python 3.9+
Biopython (pip install biopython)
Bakta (for genome annotation) - automatically included in containers

Input Files

Genome sequence: .fasta or .fna file containing the assembled genome. Note a compressed version is required to run Bakta.
Gene annotations: .gff3 file with gene annotations (we recommend using bakta for annotation)

Bakta Database

If using Bakta for annotation, you'll need to download the Bakta database:

# Light database (faster, smaller)
bakta_db download --output bakta_db --type light

# Full database
bakta_db download --output bakta_db

Usage

Step 1: Locate Attachment Sites

First, identify att sites in your genome:

sccmec-locate-att -f genome.fna -g genome.gff3 -o att_sites.tsv

Or using the Python script directly:

python src/sccmecextractor/locate_att_sites.py -f genome.fna -g genome.gff3 -o att_sites.tsv

Parameters:

-f, --fna: Input genome file (.fasta or .fna)
-g, --gff: Gene annotation file (.gff3 format)
-o, --outfile: Output TSV file containing att site locations

Output: The script generates a TSV file with the following columns:

Input_File
Pattern (attR, attL, cattR, cattL, attR2, cattR2)
Contig
Start position
End position
Matching_Sequence

Step 2: Extract SCCmec Sequences

Extract the SCCmec sequence using the identified att sites:

python extract_SCCmec.py -f genome.fna -g genome.gff3 -a att_sites.tsv -s output_directory

Or using the Python script directly:

python src/sccmecextractor/extract_SCCmec.py -f genome.fna -g genome.gff3 -a att_sites.tsv -s output_directory

Parameters:

-f, --fna: Input genome file (.fasta or .fna)
-g, --gff: Gene annotation file (.gff3 format)
-a, --att: TSV file from step 1 containing att site locations
-s, --sccmec: Output directory for extracted SCCmec sequences

Output: The script creates a FASTA file named {genome}_SCCmec.fasta in the specified output directory containing the extracted SCCmec sequence.

Complete Workflow Example

Local Installation

# 1. Annotate your genome with bakta (recommended)
bakta --db bakta_db genome.fna --output bakta_output

# 2. Locate att sites
sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv

# 3. Extract SCCmec sequence
sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output

Docker

# Complete pipeline in one command
docker run --rm \
  -v $PWD:/work \
  -v ~/bakta_db:/data/bakta_db \
  sccmecextractor:latest \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv && \
    sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output
  "

Singularity

# Complete pipeline in one command
singularity exec \
  --bind $PWD:/work \
  --bind ~/bakta_db:/data/bakta_db \
  sccmecextractor.sif \
  bash -c "
    bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
    sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv && \
    sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output
  "

How It Works

I hope to publish this tool someday but until then here is an overview of how the tool performs its functions.

Attachment Site Detection

The tool searches for specific DNA motifs that represent attachment sites:

attR: Right attachment site patterns
attL: Left attachment site patterns
cattR: Complementary right attachment sites
cattL: Complementary left attachment sites
attR2/cattR2: Alternative right attachment site patterns
attL2/cattL2: Alternative left attachment site patterns

The script uses regex patterns with degeneracy to account for sequence variation in these sites.

SCCmec Extraction Logic

Site Validation: Identifies the closest attR-attL pair on the same contig
Gene Context: Locates the rlmH gene, which is used as a reference point
Coordinate Determination: Calculates extraction coordinates based on rlmH position and att sites
Sequence Extraction: Extracts the region between att sites with appropriate padding
Orientation Handling: Automatically handles reverse complement extraction when necessary

Key Features

Intelligent Filtering: attR and attR2 sites are only considered if they fall within rlmH genes
Distance Optimisation: Selects the closest attR-attL pair to minimise extraction of non-SCCmec sequences
Strand Awareness: Automatically detects and handles SCCmec elements on reverse strands
Quality Control: Validates presence of required genes and att sites before extraction

Output Format

The extracted SCCmec sequence is saved as a FASTA file with:

ID: {input_file}_{contig}_{start}_{end}
Description: attR:{right_att_info}_attL:{left_att_info}

Troubleshooting

Common Issues

No att sites found:

Check that your genome contains SCCmec elements
Verify that the input FASTA file is properly formatted
Ensure the GFF3 file corresponds to the same genome assembly

No rlmH gene found:

This may indicate there is an issue with your input genome as rlmH is a conserved gene for Staphylococcus
Verify that gene annotation was performed correctly
Check that the GFF3 file contains gene features with proper naming - rlmH must be annotated as such

Missing attR-attL pairs:

Some genomes may have incomplete or atypical SCCmec elements
Check the att_sites.tsv output to see which sites were detected

Container-specific issues:

See CONTAINER_GUIDE.md for troubleshooting Docker and Singularity problems

Warning Messages

The tools provide informative warning messages to help diagnose issues:

Missing gene annotations
Incomplete att site pairs
File processing errors

Citation

If you use SCCmecExtractor in your research, please cite this repository:

MacFadyen, A.C. SCCmecExtractor: A toolkit for extracting SCCmec sequences from Staphylococcus genomes. 
GitHub repository: https://github.com/AlisonMacFadyen/SCCmecExtractor

Work in Progress

I aim to add in bakta annotation as part of the pipeline, as well as to include information on SCCmec gene carriage and Typing information. In the meantime, for typing, I recommend checking out this tool: sccmec

If you have any additional ideas, please let me know.

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Contact

Email: alison.macfadyen86@gmail.com

Acknowledgments

Bakta for bacterial genome annotation
The Biopython project for sequence manipulation tools

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

amacfadyen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

Apr 7, 2026

1.3.0

Mar 1, 2026

1.2.3

Feb 7, 2026

This version

1.2.2

Feb 7, 2026

1.2.1

Feb 6, 2026

1.1.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sccmecextractor-1.2.2.tar.gz (19.4 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sccmecextractor-1.2.2-py3-none-any.whl (14.4 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file sccmecextractor-1.2.2.tar.gz.

File metadata

Download URL: sccmecextractor-1.2.2.tar.gz
Upload date: Feb 7, 2026
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sccmecextractor-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`ed42acfd995bc2fa3d2e9c8e1799c6eef8a1fb8c09446890ce31747bc0da4fc4`
MD5	`7748796329a1f2a3f80d0faf136e8b30`
BLAKE2b-256	`f9fb974b1353fed697169bae43bedfbb0b328522a40a09de79c838226f13130f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.2.2.tar.gz:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sccmecextractor-1.2.2.tar.gz
- Subject digest: ed42acfd995bc2fa3d2e9c8e1799c6eef8a1fb8c09446890ce31747bc0da4fc4
- Sigstore transparency entry: 926684986
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: AlisonMacFadyen/SCCmecExtractor@9edb529a31c23af137fd6aa1023b14825a1ceb08
- Branch / Tag: refs/tags/v1.2.2
- Owner: https://github.com/AlisonMacFadyen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd-pypi.yaml@9edb529a31c23af137fd6aa1023b14825a1ceb08
- Trigger Event: release

File details

Details for the file sccmecextractor-1.2.2-py3-none-any.whl.

File metadata

Download URL: sccmecextractor-1.2.2-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sccmecextractor-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0137dda7775d6463109c7ff2f079aa174b4db9efc967d0104ca9b3ea9edd5c54`
MD5	`68d607ecbc8c66c87365ff2ed258097f`
BLAKE2b-256	`3d14cdfb6e1f2258d56cdd654d2a8d5bfeb9451f4748c1e8a0c1c01e6a6b7155`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sccmecextractor-1.2.2-py3-none-any.whl:

Publisher: cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sccmecextractor-1.2.2-py3-none-any.whl
- Subject digest: 0137dda7775d6463109c7ff2f079aa174b4db9efc967d0104ca9b3ea9edd5c54
- Sigstore transparency entry: 926685023
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: AlisonMacFadyen/SCCmecExtractor@9edb529a31c23af137fd6aa1023b14825a1ceb08
- Branch / Tag: refs/tags/v1.2.2
- Owner: https://github.com/AlisonMacFadyen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd-pypi.yaml@9edb529a31c23af137fd6aa1023b14825a1ceb08
- Trigger Event: release

sccmecextractor 1.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SCCmecExtractor

Overview

Table of Contents

Installation

Using Conda/Mamba

Using pip

Using Docker

Using Singularity

Requirements

Dependencies

Input Files

Bakta Database

Usage

Step 1: Locate Attachment Sites

Step 2: Extract SCCmec Sequences

Complete Workflow Example

Local Installation

Docker

Singularity

How It Works

Attachment Site Detection

SCCmec Extraction Logic

Key Features

Output Format

Troubleshooting

Common Issues

Warning Messages

Citation

Work in Progress

License

Contributing

Contact

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance