Allele-specific analysis of next-generation sequencing data with high-performance multi-format variant support (VCF/cyvcf2/PGEN)
Project description
WASP2: Allele-specific pipeline for unbiased read mapping and allelic-imbalance analysis
Requirements
- Python >= 3.10
- numpy
- pandas
- polars
- scipy
- pysam
- pybedtools
- typer
- anndata
- Optional: cyvcf2 (for high-performance VCF parsing - ~7x faster)
- Optional: Pgenlib (for PLINK2 PGEN format support - ~25x faster variant I/O)
- Rust extension (PyO3) built locally; the Python CLI now routes counting, mapping, and analysis through Rust. Build it after creating the conda env:
conda activate WASP2 export LIBCLANG_PATH=$CONDA_PREFIX/lib export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH export BINDGEN_EXTRA_CLANG_ARGS="-I/usr/include" (cd rust && maturin develop --release)
Installation
Quick Start: GitHub Codespaces ☁️
Get started in seconds with a fully configured cloud development environment:
- Click "Code" → "Codespaces" → "Create codespace"
- Wait 2-3 minutes for setup
- Start using WASP2 - all dependencies pre-installed!
See .devcontainer/README.md for details.
Local Installation
Option 1: pip (Easiest - Pre-built wheels)
pip install wasp2
Pre-built wheels are available for Linux (x86_64, aarch64) and macOS (Intel, Apple Silicon).
Option 2: Conda (Recommended for bioinformatics tools)
# Create environment with bioinformatics tools (samtools, bcftools, bedtools, etc.)
conda env create -f environment.yml
conda activate WASP2
# Build Rust extension
export LIBCLANG_PATH=$CONDA_PREFIX/lib
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export BINDGEN_EXTRA_CLANG_ARGS="-I/usr/include"
maturin develop --release -m rust/Cargo.toml
Option 3: Development Install (From source)
# Clone the repository
git clone https://github.com/Jaureguy760/WASP2-final.git
cd WASP2-final
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install samtools bcftools bedtools bwa libclang-dev
# Install Rust if not already installed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install Python package with dev dependencies
pip install -e ".[dev]"
# Build Rust extension
maturin develop --release -m rust/Cargo.toml
Option 4: Docker (Recommended for reproducibility)
# Pull from GitHub Container Registry
docker pull ghcr.io/jaureguy760/wasp2-final:latest
# Run WASP2 commands
docker run --rm -v /path/to/data:/data ghcr.io/jaureguy760/wasp2-final:latest \
wasp2-count count-variants /data/reads.bam /data/variants.vcf.gz -o /data/counts.tsv
# Interactive shell
docker run --rm -it -v /path/to/data:/data ghcr.io/jaureguy760/wasp2-final:latest bash
Option 5: Singularity/Apptainer (HPC environments)
# Pull from GHCR and rename
singularity pull wasp2.sif docker://ghcr.io/jaureguy760/wasp2-final:latest
# Run WASP2 commands
singularity exec wasp2.sif wasp2-count count-variants reads.bam variants.vcf.gz
# Bind mount data directories for HPC
singularity exec --bind /scratch:/scratch wasp2.sif \
wasp2-analyze find-imbalance /scratch/counts.tsv -o /scratch/results.tsv
Nextflow Integration
# Use container profile for reproducible pipelines
nextflow run main.nf -profile docker # Uses Docker
nextflow run main.nf -profile singularity # Uses Singularity
Verify Installation
wasp2-count --help
wasp2-map --help
wasp2-analyze --help
python -c "import wasp2_rust; print('Rust extension OK')"
Supported Variant File Formats
WASP2 supports multiple variant file formats through a unified interface:
| Format | Extensions | Notes | Performance |
|---|---|---|---|
| VCF (pysam) | .vcf, .vcf.gz, .vcf.bgz |
Standard text format | Baseline |
| VCF (cyvcf2) | .vcf, .vcf.gz, .vcf.bgz |
High-performance parser | ~7x faster |
| BCF | .bcf, .bcf.gz |
Binary VCF format | 5-8x faster |
| PGEN | .pgen |
PLINK2 binary format | ~25x faster |
Performance Recommendations
Choose your variant format based on your workflow:
- PGEN (Fastest - ~25x) - Best for large variant datasets, genotype-only workflows
- cyvcf2 (Fast - ~7x) - High-performance VCF parsing, drop-in replacement for pysam
- BCF (Fast - 5-8x) - Good balance of speed and compatibility, preserves all VCF fields
- VCF.gz (pysam) - Most compatible, use when sharing data or using other tools
cyvcf2 Support (High-Performance VCF Parsing)
For faster VCF parsing without changing file formats:
# Install cyvcf2 (optional dependency)
pip install wasp2[cyvcf2]
# or: pip install cyvcf2
# Use same VCF files - automatically uses cyvcf2 if installed
wasp2-count count-variants reads.bam variants.vcf.gz -s sample1 -r regions.bed
# Explicit usage in Python
from wasp2.io.cyvcf2_source import CyVCF2Source
with CyVCF2Source("variants.vcf.gz") as source:
for variant in source.iter_variants(het_only=True):
... # ~7x faster than pysam!
Benchmark results show ~7x speedup for VCF parsing with cyvcf2 vs pysam on large files.
📖 See docs/VCF_PERFORMANCE.md for detailed benchmarks and usage guide.
BCF Support (Recommended for Standard Workflows)
BCF (binary VCF) provides 5-8x faster parsing than compressed VCF with no loss of information:
# Convert VCF to BCF using bcftools
bcftools view variants.vcf.gz -Ob -o variants.bcf
bcftools index variants.bcf
# Use BCF directly in WASP2
wasp2-count count-variants reads.bam variants.bcf -s sample1 -r regions.bed
PGEN Support (Recommended for Large Datasets)
For optimal performance with large variant datasets, use PLINK2's PGEN format:
# Install pgenlib (optional dependency)
pip install wasp2[plink]
# or: pip install Pgenlib
# Convert VCF to PGEN using plink2
plink2 --vcf variants.vcf.gz --make-pgen --out variants
# Use PGEN directly in WASP2
wasp2-count count-variants reads.bam variants.pgen -s sample1 -r regions.bed
Benchmark results show ~25x speedup for variant I/O operations with PGEN vs VCF.
Quick CLI usage
- Count:
wasp2-count count-variants BAM VARIANTS --regions BED --out_file counts.tsv - Map filter:
wasp2-map filter-remapped ORIGINAL.bam REMAPPED.bam OUT.bam - Analyze:
wasp2-analyze find-imbalance counts.tsv --out_file ai_results.tsv
Minimal API (Rust-backed)
from counting.run_counting import run_count_variants
from mapping.filter_remap_reads import filt_remapped_reads
from analysis.run_analysis import run_ai_analysis
# Supports VCF, BCF, or PGEN variant files
run_count_variants(bam_file="sample.bam", variant_file="variants.pgen", region_file="regions.bed")
filt_remapped_reads("orig.bam", "remap.bam", "keep.bam", threads=4)
run_ai_analysis("counts.tsv", out_file="ai_results.tsv")
Validation
Baselines are generated on-demand. With the extension built:
export PYTHONPATH=$PWD
python validation/generate_baselines.py
python validation/compare_to_baseline.py
This runs counting, mapping, and analysis on the small chr10 test bundle and checks parity.
Publish to private index (example)
maturin build --release -m rust/Cargo.toml
pip install twine
twine upload --repository-url https://<private-index>/simple dist/*.whl
# Install:
pip install --index-url https://<private-index>/simple wasp2
API Documentation
Build Sphinx docs locally:
pip install sphinx sphinx-rtd-theme sphinx-autodoc-typehints
cd docs && make html
# Open docs/build/html/index.html in a browser
Allelic Imbalance Analysis
Analysis pipeline currently consists of two tools (Count and Analysis)
Count Tool
Process allele specific read counts per SNP.
Sample names can be provided in order to filter out non-heterozygous SNPs.
Genes and ATAC-seq peaks can also be provided to include SNPs that overlap regions of interest.
Providing samples and regions is highly recommended for allelic-imbalance analysis
Usage
wasp2-count count-variants [BAM] [VARIANTS] {OPTIONS}
Required Arguments
- BAM file containing aligned reads.
- VARIANTS file containing SNP info (VCF, BCF, or PGEN format)
Optional Arguments
- -s/--samples: Filter SNPs whose genotypes are heterozygous in one or more samples. Accepts comma delimited string, or file with one sample per line.
- -r/--region: Filter SNPs that overlap peaks/regions of interest. Accepts files in narrowPeak, BED, gtf and gff3 format.
- -o/--out_file: Output file for counts. Defaults to counts.tsv
- -t/--temp_loc: Write intermediary files to a directory instead of deleting. Useful for debugging issues.
- --use_region_names: If regions are provided use region names as identifiers instead of coordinates. Names are denoted in fourth column of BED. Ignored if no name column in BED file.
RNA-Seq Specific Arguments
- --gene_feature: Feature type in gtf/gff3 for counting intersecting SNPs. Defaults to 'exon' for snp counting.
- --gene_attribute: Attribute name from gtf/gff3 attribute column to use as ID. Defaults to '_id' in gtf and 'ID' in gff3.
- --gene_parent: Parent attribute in gtf/gff3 for feature used in counting. Defaults to 'transcript_id' in gtf and 'Parent' in gff3.
Analysis Tool
Analyzes Allelic Imbalance per ATAC peak given allelic count data
Usage
wasp2-analyze find-imbalance [COUNTS] {OPTIONS}
Required Arguments
- COUNTS: Output data from count tool
Optional Arguments
- -o/--out_file: Output file to write analysis results to. (Default. ai_results.tsv)
- --min: Minimum allele count needed for analysis. (Default. 10)
- -p/--pseudocount: Pseudocount added when measuring allelic imbalance. (Default. 1)
- --phased: Calculate allelic imbalance using phased haplotype model. By default, calculates AI assuming unphased/equal likelihood for each haplotype.
- --region_col: Name of region column for current data. Use 'region' for ATAC-seq. Plans for 'genes' for RNA-seq and 'SNP' for per SNP. Recommended to leave blank. (Default: Auto-parses if none provided)
- --groupby: Report allelic imbalance by parent group instead of feature level in RNA-seq counts. Name of parent column. Not valid if no parent column or if using ATAC-seq peaks. (Default: Report by feature level instead of parent level)
Unbiased Allele-Specific Read Mapping
Mappability filtering pipeline for correcting allelic mapping biases.
First, reads are mapped normally using a mapper chosen by the user (output as BAM). Then mapped reads that overlap single nucleotide polymorphisms (SNPs) are identified. For each read that overlaps a SNP, its genotype is swapped with that of the other allele and the read is re-mapped. Re-mapped reads that fail to map to exactly the same location in the genome are discarded.
Step 1: Create Reads for Remapping
This step identifies reads that overlap snps and creates reads with swapped alleles.
Usage
wasp2-map make-reads [BAM] [VARIANTS] {OPTIONS}
Required Arguments
- BAM file containing aligned reads.
- VARIANTS file containing SNP info (VCF, BCF, or PGEN format)
Optional Arguments
- --threads: Threads to allocate.
- -s/--samples: Filter Polymorphic SNPs in one or more samples. Accepts comma delimited string, or file with one sample per line.
- -o/--out_dir: Output directory for data to be remapped
- -t/--temp_loc: Write intermediary files to a directory instead of deleting. Useful for debugging issues.
- -j/--out_json: Output json containing wasp file info to this file instead of default. Defaults to [BAM_PREFIX]_wasp_data_files.json
Step 2: Remap Reads
Remap fastq reads using mapping software of choice
Example
bwa mem -M "BWAIndex/genome.fa" "${prefix}_swapped_alleles_r1.fq" "${prefix}_swapped_alleles_r2.fq" | samtools view -S -b -h -F 4 - > "${prefix}_remapped.bam"
samtools sort -o "${prefix}_remapped.bam" "${prefix}_remapped.bam"
samtools index "${prefix}_remapped.bam"
Step 3: Filter Reads that Fail to Remap
Identify and remove reads that failed to remap to the same position. Creates allelic-unbiased bam file
Usage
wasp2-map filter-remapped "${prefix}_remapped.bam" --json "${prefix}_wasp_data_files.json"
OR
wasp2-map filter-remapped "${prefix}_remapped.bam" "${prefix}_to_remap.bam" "${prefix}_keep.bam"
Required Arguments
- Remapped BAM File
- Either: json or to_remap_bam + keep.bam
- -j/--json: json containing wasp file info. Default output from make-reads: [BAM_PREFIX]_wasp_data_files.json
- to_remap_bam: to_remap_bam used to generate swapped alleles. Default: [BAM_PREFIX]_to_remap.bam
- keep_bam: BAM containing reads that were not remapped. Default: [BAM_PREFIX]_keep.bam
Optional Arguments
- --threads: Threads to allocate.
- -o/--out_bam: File to write filtered bam. Defaults to [BAM_PREFIX]_wasp_filt.bam.
- --remap_keep_bam: Output bam file with kept reads to this file if provided.
- --remap_keep_file: Output txt file with kept reads names to this file if provided.
Single-Cell Allelic Counts
Process allele specific read counts for single-cell datasets.
Output counts as anndata containing cell x SNP count matrix.
Usage
wasp2-count count-variants-sc [BAM] [VARIANTS] [BARCODES] {OPTIONS}
Required Arguments
- BAM file containing aligned reads.
- VARIANTS file containing SNP info (VCF, BCF, or PGEN format)
- BARCODE file used as index, contains one cell barcode per line
Optional Arguments
- -s/--samples: Filter SNPs whose genotypes are heterozygous in one or more samples. Accepts comma delimited string, or file with one sample per line. RECOMMENDED TO USE ONE SAMPLE AT A TIME.
- -f/--feature: Features used in single-cell experiment. Filter SNPs that overlap regions/features of interest. Accepts BED formatted files.
- -o/--out_file: Output file for counts. Defaults to allele_counts.h5ad
- -t/--temp_loc: Write intermediary files to a directory instead of deleting. Useful for debugging issues.
Single-Cell Allelic Imbalance
Estimate allele-specific chromatin acccessibility using single-cell allelic counts.
Allelic-Imbalance is estimated on a per-celltype basis.
Usage
wasp2-analyze find-imbalance-sc [COUNTS] [BARCODE_MAP] {OPTIONS}
Required Arguments
- COUNTS file (.h5ad) containing matrix of single-cell allelic counts.
- BARCODE MAP: Two column TSV file mapping specific cell barcodes to some group/celltype.
Each line following format ... [BARCODE] \t [CELLTYPE]
Optional Arguments
- -o/--out_file: Output file to write analysis results to. (Default. ai_results_[GROUP].tsv)
- --min: Minimum allele count needed for analysis. (Default. 10)
- -p/--pseudocount: Pseudocount added when measuring allelic imbalance. (Default. 1)
- -s/--sample: Use het genotypes for this sample in count matrix. Automatically parse if data contains 0 or 1 sample. REQUIRED IF MULTIPLE SAMPLES IN DATA.
- --phased: Calculate allelic imbalance using phased haplotype model. By default, calculates AI assuming unphased/equal likelihood for each haplotype.
- --unphased: Explicitly use unphased model.
- -z/--z_cutoff: Remove SNPS and associated regions whose counts exceed Z-score cutoff. Extra layer of QC for single-cell allelic counts
Single-Cell Comparative Imbalance
Compare differential allelic-imbalance between celltypes/groups.
Usage
wasp2-analyze compare-imbalance [COUNTS] [BARCODE_MAP] {OPTIONS}
Required Arguments
- COUNTS file (.h5ad) containing matrix of single-cell allelic counts.
- BARCODE MAP: Two column TSV file mapping specific cell barcodes to some group/celltype.
Each line following format ... [BARCODE] \t [CELLTYPE]
Optional Arguments
- -o/--out_file: Output file to write analysis results to. (Default. ai_results_[GROUP1]_[GROUP2].tsv)
- --groups/--celltypes: Specific groups in barcode map to compare differential allelic imbalance. If providing input requires 2 groups minimum, otherwise compare all group combinations.
- --min: Minimum allele count needed for analysis. (Default. 10)
- -p/--pseudocount: Pseudocount added when measuring allelic imbalance. (Default. 1)
- -s/--sample: Use het genotypes for this sample in count matrix. Automatically parse if data contains 0 or 1 sample. REQUIRED IF MULTIPLE SAMPLES IN DATA.
- --phased: Calculate allelic imbalance using phased haplotype model. By default, calculates AI assuming unphased/equal likelihood for each haplotype.
- --unphased: Explicitly use unphased model.
- -z/--z_cutoff: Remove SNPS and associated regions whose counts exceed Z-score cutoff. Extra layer of QC for single-cell allelic counts
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wasp2-1.3.0.tar.gz.
File metadata
- Download URL: wasp2-1.3.0.tar.gz
- Upload date:
- Size: 172.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0b90e013cf86d1d1b651830c12677e4845be5e51e08fde39024a7e883f9c036
|
|
| MD5 |
7797ee651c0ac55000b774ec8accd34e
|
|
| BLAKE2b-256 |
297ef622ebe5fd588080da01a68d27442328e5550194b152a6164de07dbb427c
|
File details
Details for the file wasp2-1.3.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7066719d7d59075c849908c58be2f0a5820a33ca003e7139f2808de9fe2f3a4
|
|
| MD5 |
ba8f91a7067c062c6b9fb6b84557578f
|
|
| BLAKE2b-256 |
35cf1120411ce214f209dab20e929f87712983964d1eaf13e06d97cfb5a4f7fe
|
File details
Details for the file wasp2-1.3.0-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2920ab0b32d0ecf561c953e0ee6e93648d33291e486187e95c31d7ae62c33196
|
|
| MD5 |
8ad27dbe407e51314c01c8eb32c94c4e
|
|
| BLAKE2b-256 |
10aec7107b34ba2cd0bc5b9f1cac88bf0a9036ff8af2a3f47f3a01592df345ce
|
File details
Details for the file wasp2-1.3.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09c42482316d7d212cf19d8d8e5a9da859f2139570f1d2625f6e3014579e80ed
|
|
| MD5 |
b2b7d74760322d25fcb62083156c5b2b
|
|
| BLAKE2b-256 |
29fbffa71d68447d60fc6461cc31e37aefe9a087c54826d063e5bfd4ea9871c5
|
File details
Details for the file wasp2-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d8489975fdb5e40a35ff6848dc76b2303371d0e38154d224dada8a40f18a4eb
|
|
| MD5 |
5363a99ddaa2c28e8014cdf64d56ee9b
|
|
| BLAKE2b-256 |
82351dc5b3f6998a54f2d92872496d1085b6e98dc15ab7769b8ed58a3c37c3bf
|
File details
Details for the file wasp2-1.3.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6052c25a21db93a0daa435d939a212fe80ae4dfd7755041c88d67c047e62aff
|
|
| MD5 |
b1f379216c6ad14b2e8179f602ecad16
|
|
| BLAKE2b-256 |
2aaae056d3443a7f9a1b3cde24037e2816fc1e4048ed5d7b48b1c311214fc657
|
File details
Details for the file wasp2-1.3.0-cp310-cp310-macosx_10_12_x86_64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp310-cp310-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc911bcc302d4f03dcfac03776a6b8fe62b295c75113f04dd52602680a703e9
|
|
| MD5 |
f52f0c939a9f2690602b9fe792dafe8d
|
|
| BLAKE2b-256 |
3e088dba6cd2a90e74c755ee0a8c1bc3a988e063fb42c4bcfe108092802c7f1c
|
File details
Details for the file wasp2-1.3.0-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25905de8c2e10ac3c496225569f5695431503c043600b131a9b3a5f38fc68889
|
|
| MD5 |
78e288f746e7a0d094d1d817e11fa8d2
|
|
| BLAKE2b-256 |
0657f0c70f4be8cef891c8e22e642bfd29804108eae744740061f385ba5df7f7
|
File details
Details for the file wasp2-1.3.0-cp39-cp39-macosx_10_12_x86_64.whl.
File metadata
- Download URL: wasp2-1.3.0-cp39-cp39-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b4c3ee010bc1a702c0f6733e83a97fed07fc2532ea5d57e48cd02afa0f8d093
|
|
| MD5 |
621bab5b9faf577009028cd94038c60f
|
|
| BLAKE2b-256 |
84a6da6c8ea5d1e85dfabdf6e1556a1ce48534c2ca08dc43033d51709a4882ee
|