Skip to main content

SCiMS: Sex Calling in Metagenomic Sequences

Project description


Logo

Sex Calling in Metagenomic Sequences

An tool for identifying the sex of a host organism based on the alignment of metagenomic sequences.

Report Bug · Request Feature

About The Project

Metagenomic sequencing data often contains a mix of host and non-host sequences. SCiMS salvages the reads mapping statistics that align to the host genome and uses them to identify the sex of the host organism. SCiMS leverages robust statistical methods to accurately determine the sex of the host, providing host sex information for downstream analyses.

Requirements

  • Python 3.9+
  • numpy, pandas, scipy, setuptools
  • (Optional) samtools for generating .idxstats files

Installation instructions

The simpliest installation works through the conda installer that can maintain different versions of Python on the same machine.

# Create a new conda environment with Python 3.9
conda create -n scims python=3.9

# Activate the environment
conda activate scims

# Install SCiMS
pip install scims

To confirm that the instillation was successful, run:

scims -h

Usage

SCiMS can be used on any alignment data, regardless of the platform used for sequencing or the aligner that generated the alignment file.

scims --idxstats_file <sample.idxstats> \
        --scaffolds <scaffolds.txt> \
        --metadata <metadata_file.txt> \
        --system <XY or ZW> \
        --homogametic_id <chrom_id> \
        --heterogametic_id <chrom_id> \
        --id_column <sample-id> \
        --output <output_file.txt>
Option Description
-h, --help Show this help message and exit
--idxstats_file Path to the .idxstats file for the sample
--scaffolds Path to the scaffolds.txt file containing the scaffolds of interest
--heterogametic_id The ID of the heterogametic sex chromosome
--homogametic_id The ID of the homogametic sex chromosome
--system The sex determination system (XY or ZW)
--output Path to the output file
--threshold [OPTIONAL] The threshold for the sex calling algorithm (default: 0.95)
--training_data [OPTIONAL] If you have a training dataset, you can specify the path to the training data here
--multiple [OPTIONAL] If you want to run SCiMS on multiple samples, you can specify this option [True or False, default: False]
--metadata_file [OPTIONAL] If you have a metadata file and would like to add SCiMS predicted sex to the metadata file, you can specify the path to the metadata file here
--id_column [OPTIONAL] The column name of the sample ID in the metadata file
--log [OPTIONAL] Path to log file

Required input files

scaffolds.txt

Since most assemblies include scaffolds representing other DNA than simply genomic (ex. mitochondrial), it is necessary to define what scaffolds we are interested in using for our analysis. This can be specified with a scaffolds.txt file. This is a single-column text file where each row is a scaffold ID. Here is an example,

NC_000001.11
NC_000002.12
NC_000003.12
NC_000004.12
NC_000005.10
NC_000006.12
NC_000007.14
NC_000008.11
...

.idxstats files

A .idxstats file can easily be created with samtools. If you have a .bam file of interest, fun the following commands to generate the .idxstats file:

samtools index <bam_file>
samtools idxstats <bam_file> > <prefix>.idxstats

metadata_file.txt

A metadata file is required to run SCiMS. This file should contain at least one columns, sample-id. The sample-id column should contain the sample IDs that are present in the .idxstats file.

Example:

sample-id	feature
sample1		A
sample2		B
sample3		C
sample4		D

Example run

Example files can be found in the test_data folder.

Running SCiMS on a single sample

Change path to the test_data folder and run the following command:

scims --idxstats_file ./idxstats_files/S79F300.idxstats \
      --scaffolds GRCh38_scaffolds.txt \
      --system XY \
      --homogametic_id NC_000023.11 \
      --heterogametic_id NC_000024.10 \
      --output test_output.txt

Output log:

2025-03-06 00:22:34,576 - INFO - Log file created at: out/scims.log
2025-03-06 00:22:34,576 - INFO -  
=================================================
2025-03-06 00:22:34,576 - INFO - 
    _|_|_|   _|_|_|  _|_|_|  _|      _|   _|_|_|  
    _|      _|         _|    _|_|  _|_|   _|        
    _|_|_|  _|         _|    _|  _|  _|   _|_|_|    
        _|  _|         _|    _|      _|       _|  
    _|_|_|   _|_|_|  _|_|_|  _|      _|   _|_|_|    
    =================================================
2025-03-06 00:22:34,576 - INFO - SCiMS: Sex Calling in Metagenomic Sequencing
2025-03-06 00:22:34,576 - INFO - Version: 1.0.0
2025-03-06 00:22:34,576 - INFO - =================================================
2025-03-06 00:22:34,591 - INFO - Results written to out/S79F300_results.txt

Output file:

$ cat out/S79F300_results.txt

Running SCiMS on multiple samples

scims   --idxstats_folder idxstats_files/  \
        --scaffolds GRCh38_scaffolds.txt \
        --homogametic_id NC_000023.11 \
        --heterogametic_id NC_000024.10 \
        --output_dir out \
        --metadata metadata_file.txt \
        --id_column sample-id \
        --log log.txt

Output log:

2025-03-06 00:29:09,830 - INFO - Log file created at: out/scims.log
2025-03-06 00:29:09,830 - INFO -  
=================================================
2025-03-06 00:29:09,830 - INFO - 
    _|_|_|   _|_|_|  _|_|_|  _|      _|   _|_|_|  
    _|      _|         _|    _|_|  _|_|   _|        
    _|_|_|  _|         _|    _|  _|  _|   _|_|_|    
        _|  _|         _|    _|      _|       _|  
    _|_|_|   _|_|_|  _|_|_|  _|      _|   _|_|_|    
    =================================================
2025-03-06 00:29:09,830 - INFO - SCiMS: Sex Calling in Metagenomic Sequencing
2025-03-06 00:29:09,830 - INFO - Version: 1.0.0
2025-03-06 00:29:09,830 - INFO - =================================================
2025-03-06 00:29:09,845 - INFO - Results written to out/S28M1000000_results.txt
2025-03-06 00:29:09,846 - INFO - Updated metadata with classification results written to out/metadata_with_classification.txt
2025-03-06 00:29:09,848 - INFO - Results written to out/S56F150_results.txt
2025-03-06 00:29:09,849 - INFO - Updated metadata with classification results written to out/metadata_with_classification.txt
2025-03-06 00:29:09,851 - INFO - Results written to out/S79F300_results.txt
2025-03-06 00:29:09,852 - INFO - Updated metadata with classification results written to out/metadata_with_classification.txt
2025-03-06 00:29:09,854 - INFO - Results written to out/S90M250_results.txt
2025-03-06 00:29:09,855 - INFO - Updated metadata with classification results written to out/metadata_with_classification.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scims-1.0.1.tar.gz (293.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scims-1.0.1-py3-none-any.whl (305.0 kB view details)

Uploaded Python 3

File details

Details for the file scims-1.0.1.tar.gz.

File metadata

  • Download URL: scims-1.0.1.tar.gz
  • Upload date:
  • Size: 293.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for scims-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d55d724503279fa50dfb86c282309fef193a694e4d4aec01a54897fcc399bf10
MD5 c2396154202d499ec9462431fae9e3ab
BLAKE2b-256 950dad9b1953adcbc3d2bce7191cc3b004d26433342bfde517c1dbe98a9b832d

See more details on using hashes here.

File details

Details for the file scims-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: scims-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 305.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for scims-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 53aafa1b0087b2215ffcbcf1d99fabdc90d044bd799edad8c4b482172337aa68
MD5 feddcfda94e9e1655212628b3a7bd721
BLAKE2b-256 157ea0add2d06a4a14d4155c8cccc69490a25e20b3cc66e676fc9dbb3479da60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page