A Python toolkit for extracting SCCmec sequences from Staphylococcus whole genome sequences
Project description
SCCmecExtractor
A Python toolkit for extracting SCCmec (Staphylococcal Cassette Chromosome mec) sequences from Staphylococcus whole genome sequences. This tool identifies attachment (att) sites and extracts the complete SCCmec element based on genomic context.
Note the tool is quite stringent and requires the att sites to be located on the same contig as each other and as the gene rlmH in order to extract the DNA sequence of the SCCmec
Overview
SCCmecExtractor consists of two main scripts that work together to identify and extract SCCmec sequences:
locate_att_sites.py- Identifies attachment sites in genomic sequences- Canonical attR sites: attR and the complement, cattR
- Divergent CcrC associated attR2 and the complement, cattR2
- Canonical attL and the complement, cattL
- Divergent CcrC associated attL2 and the complement, cattL2
extract_SCCmec.py- Extracts the SCCmec sequence based on identified att sites and gene annotations
Table of Contents
- Installation
- Requirements
- Usage
- Complete Workflow
- How It Works
- Output Format
- Troubleshooting
- Citation
- License
- Contributing
- Contact
Installation
Using Conda/Mamba
# Create a new environment
conda create -n sccmecextractor python=3.11
conda activate sccmecextractor
# Install dependencies
conda install -c conda-forge -c bioconda biopython bakta
# Install SCCmecExtractor
pip install sccmecextractor
# Test that scripts are available
sccmec-locate-att --help
sccmec-extract --help
If scripts do not run, make sure the environment’s bin/ directory is in your PATH:
export PATH="$CONDA_PREFIX/bin:$PATH"
Using pip
Note, installation with pip does not provide Bakta.
# Install SCCmecExtractor
pip install sccmecextractor
# Test that scripts are available
sccmec-locate-att --help
sccmec-extract --help
Using Docker
Docker provides a containerised environment with all dependencies pre-installed, including Bakta.
# Pull the pre-built image
docker pull alisonmacfadyen/sccmecextractor:latest
# Or build from source
git clone https://github.com/AlisonMacFadyen/SCCmecExtractor.git
cd SCCmecExtractor
docker build -t sccmecextractor:latest -f containers/Dockerfile .
Quick Start with Docker:
# Download Bakta Database (light in this example)
# Create a directory for the Bakta database
mkdir -p ~/bakta_db
# Download using Docker
docker run --rm -v ~/bakta_db/:/data/bakta_db \
sccmecextractor:latest \
bakta_db download --output /data/bakta_db --type light
# Run the complete pipeline
docker run --rm \
-v $PWD:/work \
-v ~/bakta_db:/data/bakta_db \
sccmecextractor:latest \
bash -c "bakta --db /data/bakta_db genome.fna.gz --output bakta_out && \
sccmec-locate-att -f genome.fna -g bakta_out/genome.gff3 -o att_sites.tsv && \
sccmec-extract -f genome.fna -g bakta_out/genome.gff3 -a att_sites.tsv -s output"
See CONTAINER_GUIDE.md for detailed Docker usage instructions.
Using Singularity
Singularity/Apptainer is ideal for HPC environments where Docker is not available.
# Build from definition file
singularity build sccmecextractor.sif containers/sccmecextractor.def
# Or pull from Docker Hub
singularity pull docker://alisonmacfadyen/sccmecextractor:latest
Quick Start with Singularity:
# Download Bakta Database (light in this example)
singularity exec \
--bind $PWD:/work \
sccmecextractor.sif \
bakta_db download --output ~/bakta_db --type light
# Run the complete pipeline
singularity exec \
--bind $PWD:/work \
--bind ~/bakta_db:/data/bakta_db \
sccmecextractor.sif \
bash -c "bakta --db /data/bakta_db genome.fna --output bakta_out && \
sccmec-locate-att -f genome.fna -g bakta_out/genome.gff3 -o att_sites.tsv && \
sccmec-extract -f genome.fna -g bakta_out/genome.gff3 -a att_sites.tsv -s output"
See CONTAINER_GUIDE.md for detailed Singularity usage instructions.
Requirements
Dependencies
- Python 3.9+
- Biopython (
pip install biopython) - Bakta (for genome annotation) - automatically included in containers
Input Files
- Genome sequence:
.fastaor.fnafile containing the assembled genome. Note a compressed version is required to run Bakta. - Gene annotations:
.gff3file with gene annotations (we recommend using bakta for annotation)
Bakta Database
If using Bakta for annotation, you'll need to download the Bakta database:
# Light database (faster, smaller)
bakta_db download --output bakta_db --type light
# Full database
bakta_db download --output bakta_db
Usage
Step 1: Locate Attachment Sites
First, identify att sites in your genome:
sccmec-locate-att -f genome.fna -g genome.gff3 -o att_sites.tsv
Or using the Python script directly:
python src/sccmecextractor/locate_att_sites.py -f genome.fna -g genome.gff3 -o att_sites.tsv
Parameters:
-f, --fna: Input genome file (.fasta or .fna)-g, --gff: Gene annotation file (.gff3 format)-o, --outfile: Output TSV file containing att site locations
Output: The script generates a TSV file with the following columns:
- Input_File
- Pattern (attR, attL, cattR, cattL, attR2, cattR2)
- Contig
- Start position
- End position
- Matching_Sequence
Step 2: Extract SCCmec Sequences
Extract the SCCmec sequence using the identified att sites:
python extract_SCCmec.py -f genome.fna -g genome.gff3 -a att_sites.tsv -s output_directory
Or using the Python script directly:
python src/sccmecextractor/extract_SCCmec.py -f genome.fna -g genome.gff3 -a att_sites.tsv -s output_directory
Parameters:
-f, --fna: Input genome file (.fasta or .fna)-g, --gff: Gene annotation file (.gff3 format)-a, --att: TSV file from step 1 containing att site locations-s, --sccmec: Output directory for extracted SCCmec sequences
Output:
The script creates a FASTA file named {genome}_SCCmec.fasta in the specified output directory containing the extracted SCCmec sequence.
Complete Workflow Example
Local Installation
# 1. Annotate your genome with bakta (recommended)
bakta --db bakta_db genome.fna --output bakta_output
# 2. Locate att sites
sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv
# 3. Extract SCCmec sequence
sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output
Docker
# Complete pipeline in one command
docker run --rm \
-v $PWD:/work \
-v ~/bakta_db:/data/bakta_db \
sccmecextractor:latest \
bash -c "
bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv && \
sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output
"
Singularity
# Complete pipeline in one command
singularity exec \
--bind $PWD:/work \
--bind ~/bakta_db:/data/bakta_db \
sccmecextractor.sif \
bash -c "
bakta --db /data/bakta_db genome.fna --output bakta_output --prefix genome && \
sccmec-locate-att -f genome.fna -g bakta_output/genome.gff3 -o att_sites.tsv && \
sccmec-extract -f genome.fna -g bakta_output/genome.gff3 -a att_sites.tsv -s sccmec_output
"
How It Works
I hope to publish this tool someday but until then here is an overview of how the tool performs its functions.
Attachment Site Detection
The tool searches for specific DNA motifs that represent attachment sites:
- attR: Right attachment site patterns
- attL: Left attachment site patterns
- cattR: Complementary right attachment sites
- cattL: Complementary left attachment sites
- attR2/cattR2: Alternative right attachment site patterns
- attL2/cattL2: Alternative left attachment site patterns
The script uses regex patterns with degeneracy to account for sequence variation in these sites.
SCCmec Extraction Logic
- Site Validation: Identifies the closest attR-attL pair on the same contig
- Gene Context: Locates the rlmH gene, which is used as a reference point
- Coordinate Determination: Calculates extraction coordinates based on rlmH position and att sites
- Sequence Extraction: Extracts the region between att sites with appropriate padding
- Orientation Handling: Automatically handles reverse complement extraction when necessary
Key Features
- Intelligent Filtering: attR and attR2 sites are only considered if they fall within rlmH genes
- Distance Optimisation: Selects the closest attR-attL pair to minimise extraction of non-SCCmec sequences
- Strand Awareness: Automatically detects and handles SCCmec elements on reverse strands
- Quality Control: Validates presence of required genes and att sites before extraction
Output Format
The extracted SCCmec sequence is saved as a FASTA file with:
- ID:
{input_file}_{contig}_{start}_{end} - Description:
attR:{right_att_info}_attL:{left_att_info}
Troubleshooting
Common Issues
No att sites found:
- Check that your genome contains SCCmec elements
- Verify that the input FASTA file is properly formatted
- Ensure the GFF3 file corresponds to the same genome assembly
No rlmH gene found:
- This may indicate there is an issue with your input genome as rlmH is a conserved gene for Staphylococcus
- Verify that gene annotation was performed correctly
- Check that the GFF3 file contains gene features with proper naming - rlmH must be annotated as such
Missing attR-attL pairs:
- Some genomes may have incomplete or atypical SCCmec elements
- Check the att_sites.tsv output to see which sites were detected
Container-specific issues:
- See CONTAINER_GUIDE.md for troubleshooting Docker and Singularity problems
Warning Messages
The tools provide informative warning messages to help diagnose issues:
- Missing gene annotations
- Incomplete att site pairs
- File processing errors
Citation
If you use SCCmecExtractor in your research, please cite this repository:
MacFadyen, A.C. SCCmecExtractor: A toolkit for extracting SCCmec sequences from Staphylococcus genomes.
GitHub repository: https://github.com/AlisonMacFadyen/SCCmecExtractor
Work in Progress
I aim to add in bakta annotation as part of the pipeline, as well as to include information on SCCmec gene carriage and Typing information. In the meantime, for typing, I recommend checking out this tool: sccmec
If you have any additional ideas, please let me know.
License
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
Contact
Email: alison.macfadyen86@gmail.com
Acknowledgments
- Bakta for bacterial genome annotation
- The Biopython project for sequence manipulation tools
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sccmecextractor-1.2.2.tar.gz.
File metadata
- Download URL: sccmecextractor-1.2.2.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed42acfd995bc2fa3d2e9c8e1799c6eef8a1fb8c09446890ce31747bc0da4fc4
|
|
| MD5 |
7748796329a1f2a3f80d0faf136e8b30
|
|
| BLAKE2b-256 |
f9fb974b1353fed697169bae43bedfbb0b328522a40a09de79c838226f13130f
|
Provenance
The following attestation bundles were made for sccmecextractor-1.2.2.tar.gz:
Publisher:
cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sccmecextractor-1.2.2.tar.gz -
Subject digest:
ed42acfd995bc2fa3d2e9c8e1799c6eef8a1fb8c09446890ce31747bc0da4fc4 - Sigstore transparency entry: 926684986
- Sigstore integration time:
-
Permalink:
AlisonMacFadyen/SCCmecExtractor@9edb529a31c23af137fd6aa1023b14825a1ceb08 -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/AlisonMacFadyen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd-pypi.yaml@9edb529a31c23af137fd6aa1023b14825a1ceb08 -
Trigger Event:
release
-
Statement type:
File details
Details for the file sccmecextractor-1.2.2-py3-none-any.whl.
File metadata
- Download URL: sccmecextractor-1.2.2-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0137dda7775d6463109c7ff2f079aa174b4db9efc967d0104ca9b3ea9edd5c54
|
|
| MD5 |
68d607ecbc8c66c87365ff2ed258097f
|
|
| BLAKE2b-256 |
3d14cdfb6e1f2258d56cdd654d2a8d5bfeb9451f4748c1e8a0c1c01e6a6b7155
|
Provenance
The following attestation bundles were made for sccmecextractor-1.2.2-py3-none-any.whl:
Publisher:
cd-pypi.yaml on AlisonMacFadyen/SCCmecExtractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sccmecextractor-1.2.2-py3-none-any.whl -
Subject digest:
0137dda7775d6463109c7ff2f079aa174b4db9efc967d0104ca9b3ea9edd5c54 - Sigstore transparency entry: 926685023
- Sigstore integration time:
-
Permalink:
AlisonMacFadyen/SCCmecExtractor@9edb529a31c23af137fd6aa1023b14825a1ceb08 -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/AlisonMacFadyen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd-pypi.yaml@9edb529a31c23af137fd6aa1023b14825a1ceb08 -
Trigger Event:
release
-
Statement type: