Python implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files
Project description
gbcms
Complete orientation-aware counting system for genomic variants
Features
- 🚀 High Performance: Rust-powered core engine with multi-threading
- 🧬 Complete Variant Support: SNP, MNP, insertion, deletion, and complex variants (DelIns, SNP+Indel)
- 🧪 WFA + PairHMM Phase 3: Pangenomic fast-path WFA alignment with PairHMM fallback for complex multi-allelic classification
- 📊 Orientation-Aware: Forward and reverse strand analysis with fragment counting
- 📏 mFSD (Mutant Fragment Size Distribution): Per-allele cfDNA fragment size profiling with KS test and log-likelihood ratio
- 🔬 Statistical Analysis: Fisher's exact test for strand bias (read-level and fragment-level)
- 📁 Flexible I/O: VCF and MAF input/output formats
- 🎯 Quality Filters: 8 configurable read and quality filtering options with heuristic BAQ
- 🧬 RNA Mode: Transcriptome-aware counting with strandedness, splice detection, and A-to-I editing
- 🔗 UMI Support: Molecule-level deduplication with UMI-aware fragment grouping
- 🔧 Normalize Command: Standalone variant normalization (left-align + REF validation) without counting
Installation
Quick install:
pip install gbcms
From source (requires Rust):
git clone https://github.com/msk-access/gbcms.git
cd gbcms
pip install .
Docker:
docker pull ghcr.io/msk-access/gbcms:X.Y.Z # Replace X.Y.Z with latest from PyPI
📖 Full documentation: https://msk-access.github.io/gbcms/
Usage
gbcms can be used in two ways:
🔧 Option 1: Standalone CLI (1-10 samples)
Best for: Quick analysis, local processing, direct control
gbcms dna \
--variants variants.vcf \
--bam sample1.bam \
--fasta reference.fa \
--output-dir results/
Output: results/sample1.vcf
Learn more:
🔄 Option 2: Nextflow Workflow (10+ samples, HPC)
Best for: Many samples, HPC clusters (SLURM), reproducible pipelines
nextflow run nextflow/main.nf \
--input samplesheet.csv \
--variants variants.vcf \
--fasta reference.fa \
--mode dna \
-profile slurm
Features:
- ✅ Automatic parallelization across samples
- ✅ SLURM/HPC integration
- ✅ Container support (Docker/Singularity)
- ✅ Resume failed runs
Learn more:
Which Should I Use?
| Scenario | Recommendation |
|---|---|
| 1-10 samples, local machine | CLI |
| 10+ samples, HPC cluster | Nextflow |
| Quick ad-hoc analysis | CLI |
| Production pipeline | Nextflow |
| Need auto-parallelization | Nextflow |
| Full manual control | CLI |
Quick Examples
CLI: DNA Single Sample
gbcms dna \
--variants variants.vcf \
--bam tumor.bam \
--fasta hg19.fa \
--output-dir results/ \
--threads 4
CLI: RNA-seq
gbcms rna \
--variants variants.vcf \
--bam rna_sample:aligned.bam \
--fasta hg19.fa \
--rna-editing-db TABLE1_hg38.txt.gz \
--output-dir results/
CLI: Normalize Variants
gbcms normalize \
--variants variants.vcf \
--fasta hg19.fa \
--output-dir results/
CLI: Multiple Samples (Sequential)
gbcms dna \
--variants variants.vcf \
--bam-list samples.txt \
--fasta hg19.fa \
--output-dir results/
Nextflow: Many Samples (Parallel)
# samplesheet.csv:
# sample,bam,bai
# tumor1,/path/to/tumor1.bam,
# tumor2,/path/to/tumor2.bam,
nextflow run nextflow/main.nf \
--input samplesheet.csv \
--variants variants.vcf \
--fasta hg19.fa \
--mode dna \
--outdir results \
-profile slurm
Documentation
📚 Full Documentation: https://msk-access.github.io/gbcms/
Quick Links:
- Installation
- CLI Quick Start
- Nextflow Workflow
- CLI Reference — DNA
- CLI Reference — RNA
- CLI Reference — Normalize
- Input Formats
- Output Formats
- Architecture
Contributing
See CONTRIBUTING.md for development guidelines.
To contribute to documentation, see the gh-pages branch.
Citation
If you use gbcms in your research, please cite:
Shah, R. et al. (2026). gbcms: A high-performance orientation-aware genotype counting system for genomic variants. Available at: https://github.com/msk-access/gbcms
BibTeX:
@software{pygbcms,
author = {Shah, Ronak and contributors},
title = {gbcms: A high-performance orientation-aware genotype counting system for genomic variants},
year = {2026},
url = {https://github.com/msk-access/gbcms},
note = {GitHub repository}
}
License
AGPL-3.0 - see LICENSE for details.
Support
- 🐛 Issues: https://github.com/msk-access/gbcms/issues
- 💬 Discussions: https://github.com/msk-access/gbcms/discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gbcms-4.0.0.tar.gz.
File metadata
- Download URL: gbcms-4.0.0.tar.gz
- Upload date:
- Size: 152.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80f69f538bd40e94216dbd5fd9d66a39f9d48b69482511f6c38a7ebb88f326c1
|
|
| MD5 |
b4fbf3f4219a221becb69c58d13d8336
|
|
| BLAKE2b-256 |
889e23f44321d252da162dbf0f91f65b100fa0aed000ec7b3f086886050af117
|
Provenance
The following attestation bundles were made for gbcms-4.0.0.tar.gz:
Publisher:
release.yml on msk-access/gbcms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gbcms-4.0.0.tar.gz -
Subject digest:
80f69f538bd40e94216dbd5fd9d66a39f9d48b69482511f6c38a7ebb88f326c1 - Sigstore transparency entry: 1150980265
- Sigstore integration time:
-
Permalink:
msk-access/gbcms@9dab9ac72ef799bd1909eab110d500a5c471f761 -
Branch / Tag:
refs/tags/4.0.0 - Owner: https://github.com/msk-access
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9dab9ac72ef799bd1909eab110d500a5c471f761 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gbcms-4.0.0-cp39-cp39-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: gbcms-4.0.0-cp39-cp39-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 6.9 MB
- Tags: CPython 3.9, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e111c1a16fc62350958ba1449b6141e344d97406802d42f47cfebd536b9594a5
|
|
| MD5 |
242ff6a06a4f12081659867974c207b9
|
|
| BLAKE2b-256 |
b85d5923fc39a390ed9abc7932bf25f18a9114eee7e0af5dc499eeb2798eec15
|
Provenance
The following attestation bundles were made for gbcms-4.0.0-cp39-cp39-manylinux_2_34_x86_64.whl:
Publisher:
release.yml on msk-access/gbcms
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gbcms-4.0.0-cp39-cp39-manylinux_2_34_x86_64.whl -
Subject digest:
e111c1a16fc62350958ba1449b6141e344d97406802d42f47cfebd536b9594a5 - Sigstore transparency entry: 1150980346
- Sigstore integration time:
-
Permalink:
msk-access/gbcms@9dab9ac72ef799bd1909eab110d500a5c471f761 -
Branch / Tag:
refs/tags/4.0.0 - Owner: https://github.com/msk-access
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9dab9ac72ef799bd1909eab110d500a5c471f761 -
Trigger Event:
push
-
Statement type: