Skip to main content

Bulk Segregant Analysis for QTL mapping from pooled whole-genome sequencing

Project description

bsaseq

PyPI version License: MIT

Bulk Segregant Analysis for QTL mapping from pooled whole-genome sequencing data.

bsaseq identifies genomic loci controlling traits by comparing allele frequencies between phenotypically distinct bulked DNA pools. It provides a complete workflow from VCF input to annotated candidate genes.

Features

  • Multi-sample bulk support (pool technical replicates)
  • Sliding window analysis with tricube smoothing
  • Z-score and G-statistic for candidate region detection
  • Publication-quality genome-wide and regional plots
  • Optional snpEff integration for variant annotation
  • Gene-level candidate ranking

Installation

From PyPI (recommended)

pip install bsaseq

From source

git clone https://github.com/username/bsaseq.git
cd bsaseq
pip install -e .

Dependencies

Required:

  • Python 3.9-3.13
  • cyvcf2, numpy, scipy, matplotlib, click, rich

Optional:

  • snpEff (for variant annotation)

Quick start

# Basic analysis
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk mutant_pool \
    --low-bulk wildtype_pool \
    --out results/my_analysis

# With multiple samples per bulk
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk "mut_rep1,mut_rep2" \
    --low-bulk "wt_rep1,wt_rep2" \
    --out results/my_analysis

# With annotation
bsaseq run \
    --vcf joint_calls.vcf.gz \
    --high-bulk mutant_pool \
    --low-bulk wildtype_pool \
    --out results/my_analysis \
    --annotate \
    --snpeff-db Sorghum_bicolor

Commands

Command Description
bsaseq run Complete BSA analysis pipeline
bsaseq samples List sample names in VCF
bsaseq plot Regenerate plots from existing output
bsaseq annotate Annotate candidates with snpEff
bsaseq check-snpeff Verify snpEff installation

Output files

File Description
*_variants.tsv Per-variant allele frequencies
*_windows.tsv Sliding window statistics
*_regions.tsv Candidate genomic regions
*_regions.bed Candidate regions in BED format
*_candidates.tsv Filtered candidate variants
*_annotated_candidates.tsv Candidates with snpEff annotation
*_candidate_genes.tsv Gene-level summary
*_summary.txt Analysis summary report
*_genome_wide.png/pdf Genome-wide Manhattan plot
*_region_*.png/pdf Zoomed candidate region plots
*_af_distribution.png Allele frequency diagnostics
*_depth_distribution.png Read depth diagnostics

Parameters

Filtering

Parameter Default Description
--min-dp 10 Minimum read depth per sample
--max-dp 200 Maximum read depth (excludes repeats)
--min-gq 20 Minimum genotype quality
--min-qual 30 Minimum variant QUAL score

Window analysis

Parameter Default Description
--window-size 1000000 Window width in bp
--step-size 250000 Step between windows in bp
--min-variants 5 Minimum variants per window

Candidate detection

Parameter Default Description
--z-threshold 3.0 Z-score cutoff for significance
--mode recessive Inheritance mode (recessive/dominant)

Methodology

Delta allele frequency (delta-AF)

For each SNP, bsaseq calculates the difference in alternate allele frequency between the high bulk (mutant pool) and low bulk (wild-type pool):

delta_AF = AF_high - AF_low

For a recessive causal mutation:

  • In the mutant bulk: individuals are homozygous mutant, so AF ~ 1.0
  • In the wild-type bulk: individuals are homozygous reference, so AF ~ 0.0
  • Expected delta_AF ~ 1.0 at the causal locus

Sliding window analysis

Variants are analyzed in overlapping genomic windows to reduce noise:

  1. Tricube-weighted smoothing: Variants near window center contribute more
  2. G-statistic: Tests for significant allele frequency differences
  3. Z-score normalization: Genome-wide standardization for peak calling

Candidate region detection

Regions are identified where consecutive windows exceed the Z-score threshold. Adjacent significant regions within 500 kb are merged.

Candidate variant filtering

Within candidate regions, variants are filtered based on inheritance mode:

Mode min_delta_AF min_AF_high max_AF_low
Recessive 0.8 0.9 0.1
Dominant 0.3 0.4 0.1

Input requirements

The VCF file must contain:

  • Biallelic SNPs (multiallelic and indels are skipped)
  • AD (allelic depth) FORMAT field for allele frequency calculation
  • GQ (genotype quality) FORMAT field for filtering (optional but recommended)

Recommended variant calling:

# GATK HaplotypeCaller (recommended)
gatk HaplotypeCaller -R ref.fa -I mutant_pool.bam -I wildtype_pool.bam -O calls.vcf.gz

# bcftools mpileup (alternative)
bcftools mpileup -f ref.fa mutant_pool.bam wildtype_pool.bam | bcftools call -mv -O z -o calls.vcf.gz

Examples

Example 1: Basic recessive mutation mapping

bsaseq run \
    --vcf pooled_calls.vcf.gz \
    --high-bulk mutant \
    --low-bulk wildtype \
    --out results/analysis \
    --mode recessive

Example 2: Dominant trait with relaxed thresholds

bsaseq run \
    --vcf pooled_calls.vcf.gz \
    --high-bulk affected \
    --low-bulk unaffected \
    --out results/dominant \
    --mode dominant \
    --z-threshold 2.5 \
    --min-dp 5

Example 3: High-coverage data with annotation

bsaseq run \
    --vcf deep_seq.vcf.gz \
    --high-bulk mut1,mut2 \
    --low-bulk wt1,wt2 \
    --out results/annotated \
    --min-dp 30 \
    --max-dp 500 \
    --annotate \
    --snpeff-db Arabidopsis_thaliana

Example 4: Regenerate plots with different format

bsaseq plot \
    --windows results/analysis_windows.tsv \
    --regions results/analysis_regions.tsv \
    --variants results/analysis_variants.tsv \
    --out new_plots \
    --format pdf

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=bsaseq --cov-report=term-missing

# Type checking
mypy src/bsaseq

Citation

If you use bsaseq in your research, please cite:

bsaseq: A tool for Bulk Segregant Analysis of pooled sequencing data (in preparation)

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bsaseq-1.0.0.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bsaseq-1.0.0-py3-none-any.whl (60.7 kB view details)

Uploaded Python 3

File details

Details for the file bsaseq-1.0.0.tar.gz.

File metadata

  • Download URL: bsaseq-1.0.0.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for bsaseq-1.0.0.tar.gz
Algorithm Hash digest
SHA256 25ff5d4a4d2985bc2873c333807b8b71551a1c797ddc71f81239987d7b32f33e
MD5 a42d4068aaa6d7f5c202c6d5ceba5ede
BLAKE2b-256 dcccfef44b36206607501414981e8595f8cac0ae0b7cdb8e30e790fdaac0fe2c

See more details on using hashes here.

File details

Details for the file bsaseq-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bsaseq-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 60.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for bsaseq-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7199dba6cd6f14d83d6214265f7ee9d3e9d92115b9594be36d74822870ad51f
MD5 ff3ce62e8bc1ce104e370bccb1799e36
BLAKE2b-256 b0302de2b855e1fcffa6c5b35fb245ef1e87157eb24cff71bb3f20179df3cc53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page