Bulk Segregant Analysis for QTL mapping from pooled whole-genome sequencing
Project description
bsaseq
Bulk Segregant Analysis for QTL mapping from pooled whole-genome sequencing data.
bsaseq identifies genomic loci controlling traits by comparing allele frequencies between phenotypically distinct bulked DNA pools. It provides a complete workflow from VCF input to annotated candidate genes.
Features
- Multi-sample bulk support (pool technical replicates)
- Sliding window analysis with tricube smoothing
- Z-score and G-statistic for candidate region detection
- Publication-quality genome-wide and regional plots
- Optional snpEff integration for variant annotation
- Gene-level candidate ranking
Installation
From PyPI (recommended)
pip install bsaseq
From source
git clone https://github.com/username/bsaseq.git
cd bsaseq
pip install -e .
Dependencies
Required:
- Python 3.9-3.13
- cyvcf2, numpy, scipy, matplotlib, click, rich
Optional:
- snpEff (for variant annotation)
Quick start
# Basic analysis
bsaseq run \
--vcf joint_calls.vcf.gz \
--high-bulk mutant_pool \
--low-bulk wildtype_pool \
--out results/my_analysis
# With multiple samples per bulk
bsaseq run \
--vcf joint_calls.vcf.gz \
--high-bulk "mut_rep1,mut_rep2" \
--low-bulk "wt_rep1,wt_rep2" \
--out results/my_analysis
# With annotation
bsaseq run \
--vcf joint_calls.vcf.gz \
--high-bulk mutant_pool \
--low-bulk wildtype_pool \
--out results/my_analysis \
--annotate \
--snpeff-db Sorghum_bicolor
Commands
| Command | Description |
|---|---|
bsaseq run |
Complete BSA analysis pipeline |
bsaseq samples |
List sample names in VCF |
bsaseq plot |
Regenerate plots from existing output |
bsaseq annotate |
Annotate candidates with snpEff |
bsaseq check-snpeff |
Verify snpEff installation |
Output files
| File | Description |
|---|---|
*_variants.tsv |
Per-variant allele frequencies |
*_windows.tsv |
Sliding window statistics |
*_regions.tsv |
Candidate genomic regions |
*_regions.bed |
Candidate regions in BED format |
*_candidates.tsv |
Filtered candidate variants |
*_annotated_candidates.tsv |
Candidates with snpEff annotation |
*_candidate_genes.tsv |
Gene-level summary |
*_summary.txt |
Analysis summary report |
*_genome_wide.png/pdf |
Genome-wide Manhattan plot |
*_region_*.png/pdf |
Zoomed candidate region plots |
*_af_distribution.png |
Allele frequency diagnostics |
*_depth_distribution.png |
Read depth diagnostics |
Parameters
Filtering
| Parameter | Default | Description |
|---|---|---|
--min-dp |
10 | Minimum read depth per sample |
--max-dp |
200 | Maximum read depth (excludes repeats) |
--min-gq |
20 | Minimum genotype quality |
--min-qual |
30 | Minimum variant QUAL score |
Window analysis
| Parameter | Default | Description |
|---|---|---|
--window-size |
1000000 | Window width in bp |
--step-size |
250000 | Step between windows in bp |
--min-variants |
5 | Minimum variants per window |
Candidate detection
| Parameter | Default | Description |
|---|---|---|
--z-threshold |
3.0 | Z-score cutoff for significance |
--mode |
recessive | Inheritance mode (recessive/dominant) |
Methodology
Delta allele frequency (delta-AF)
For each SNP, bsaseq calculates the difference in alternate allele frequency between the high bulk (mutant pool) and low bulk (wild-type pool):
delta_AF = AF_high - AF_low
For a recessive causal mutation:
- In the mutant bulk: individuals are homozygous mutant, so AF ~ 1.0
- In the wild-type bulk: individuals are homozygous reference, so AF ~ 0.0
- Expected delta_AF ~ 1.0 at the causal locus
Sliding window analysis
Variants are analyzed in overlapping genomic windows to reduce noise:
- Tricube-weighted smoothing: Variants near window center contribute more
- G-statistic: Tests for significant allele frequency differences
- Z-score normalization: Genome-wide standardization for peak calling
Candidate region detection
Regions are identified where consecutive windows exceed the Z-score threshold. Adjacent significant regions within 500 kb are merged.
Candidate variant filtering
Within candidate regions, variants are filtered based on inheritance mode:
| Mode | min_delta_AF | min_AF_high | max_AF_low |
|---|---|---|---|
| Recessive | 0.8 | 0.9 | 0.1 |
| Dominant | 0.3 | 0.4 | 0.1 |
Input requirements
The VCF file must contain:
- Biallelic SNPs (multiallelic and indels are skipped)
- AD (allelic depth) FORMAT field for allele frequency calculation
- GQ (genotype quality) FORMAT field for filtering (optional but recommended)
Recommended variant calling:
# GATK HaplotypeCaller (recommended)
gatk HaplotypeCaller -R ref.fa -I mutant_pool.bam -I wildtype_pool.bam -O calls.vcf.gz
# bcftools mpileup (alternative)
bcftools mpileup -f ref.fa mutant_pool.bam wildtype_pool.bam | bcftools call -mv -O z -o calls.vcf.gz
Examples
Example 1: Basic recessive mutation mapping
bsaseq run \
--vcf pooled_calls.vcf.gz \
--high-bulk mutant \
--low-bulk wildtype \
--out results/analysis \
--mode recessive
Example 2: Dominant trait with relaxed thresholds
bsaseq run \
--vcf pooled_calls.vcf.gz \
--high-bulk affected \
--low-bulk unaffected \
--out results/dominant \
--mode dominant \
--z-threshold 2.5 \
--min-dp 5
Example 3: High-coverage data with annotation
bsaseq run \
--vcf deep_seq.vcf.gz \
--high-bulk mut1,mut2 \
--low-bulk wt1,wt2 \
--out results/annotated \
--min-dp 30 \
--max-dp 500 \
--annotate \
--snpeff-db Arabidopsis_thaliana
Example 4: Regenerate plots with different format
bsaseq plot \
--windows results/analysis_windows.tsv \
--regions results/analysis_regions.tsv \
--variants results/analysis_variants.tsv \
--out new_plots \
--format pdf
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=bsaseq --cov-report=term-missing
# Type checking
mypy src/bsaseq
Citation
If you use bsaseq in your research, please cite:
bsaseq: A tool for Bulk Segregant Analysis of pooled sequencing data (in preparation)
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bsaseq-1.0.0.tar.gz.
File metadata
- Download URL: bsaseq-1.0.0.tar.gz
- Upload date:
- Size: 69.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25ff5d4a4d2985bc2873c333807b8b71551a1c797ddc71f81239987d7b32f33e
|
|
| MD5 |
a42d4068aaa6d7f5c202c6d5ceba5ede
|
|
| BLAKE2b-256 |
dcccfef44b36206607501414981e8595f8cac0ae0b7cdb8e30e790fdaac0fe2c
|
File details
Details for the file bsaseq-1.0.0-py3-none-any.whl.
File metadata
- Download URL: bsaseq-1.0.0-py3-none-any.whl
- Upload date:
- Size: 60.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7199dba6cd6f14d83d6214265f7ee9d3e9d92115b9594be36d74822870ad51f
|
|
| MD5 |
ff3ce62e8bc1ce104e370bccb1799e36
|
|
| BLAKE2b-256 |
b0302de2b855e1fcffa6c5b35fb245ef1e87157eb24cff71bb3f20179df3cc53
|