A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
Project description
PyWombat 🦘
A high-performance CLI tool for processing and filtering bcftools tabulated TSV files with advanced filtering capabilities, pedigree support, and de novo mutation detection.
Features
✨ Fast Processing: Uses Polars for efficient data handling
🔬 Quality Filtering: Configurable depth, quality, and VAF thresholds
👨👩👧 Pedigree Support: Trio and family analysis with parent genotypes
🧬 De Novo Detection: Sex-chromosome-aware DNM identification
📊 Flexible Output: TSV, compressed TSV, or Parquet formats
🎯 Expression Filters: Complex filtering with logical expressions
🏷️ Boolean Flag Support: INFO field flags (PASS, DB, etc.) extracted as True/False columns
⚡ Streaming Mode: Memory-efficient processing of large files
Quick Start
One-Time Usage (Recommended for Most Users)
Use uvx to run PyWombat without installation:
# Basic formatting
uvx pywombat input.tsv -o output
# With filtering
uvx pywombat input.tsv -F examples/rare_variants_high_impact.yml -o output
# De novo mutation detection
uvx pywombat input.tsv --pedigree pedigree.tsv \
-F examples/de_novo_mutations.yml -o denovo
Installation for Development/Repeated Use
# Clone the repository
git clone https://github.com/bourgeron-lab/pywombat.git
cd pywombat
# Install with uv
uv sync
# Run with uv run
uv run wombat input.tsv -o output
What Does PyWombat Do?
PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
- Expanding the
(null)INFO column: Extracts allNAME=valuefields (e.g.,DP=30;AF=0.5;AC=2) and boolean flags (e.g.,PASS,DB) into separate columns - Melting sample columns: Converts wide-format sample data into long format with one row per variant-sample combination
- Extracting genotype data: Parses
GT:DP:GQ:ADformat into separate columns with calculated VAF - Adding parent data: Joins father/mother genotypes when pedigree is provided
- Applying filters: Quality filters, expression-based filters, or de novo detection
Example Transformation
Input (Wide Format):
#CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
chr1 100 A T DP=30;AF=0.5;PASS;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
Output (Long Format):
#CHROM POS REF ALT AC AF DP PASS sample sample_gt sample_dp sample_gq sample_ad sample_vaf
chr1 100 A T 2 0.5 30 true Sample1 0/1 15 99 10 0.6667
chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99 18 1.0
Generated Columns:
- INFO fields with
=: Extracted as separate columns (e.g.,DP,AF,AC) - INFO boolean flags: Extracted as True/False columns (e.g.,
PASS,DB,SOMATIC) sample: Sample identifiersample_gt: Genotype (e.g., 0/1, 1/1)sample_dp: Read depth (total coverage)sample_gq: Genotype quality scoresample_ad: Alternate allele depth (from second value in AD field)sample_vaf: Variant allele frequency (sample_ad / sample_dp)
Basic Usage
Format Without Filtering
# Output to file
uvx pywombat input.tsv -o output
# Output to stdout (useful for piping)
uvx pywombat input.tsv
# Compressed output
uvx pywombat input.tsv -o output -f tsv.gz
# Parquet format (fastest for large files)
uvx pywombat input.tsv -o output -f parquet
# With verbose output
uvx pywombat input.tsv -o output --verbose
With Pedigree (Trio/Family Analysis)
Add parent genotype information for inheritance analysis:
uvx pywombat input.tsv --pedigree pedigree.tsv -o output
Pedigree File Format (tab-separated):
FID sample_id FatherBarcode MotherBarcode Sex Pheno
FAM1 Child1 Father1 Mother1 1 2
FAM1 Father1 0 0 1 1
FAM1 Mother1 0 0 2 1
FID: Family IDsample_id: Sample name (must match VCF)FatherBarcode: Father's sample name (0 = unknown)MotherBarcode: Mother's sample name (0 = unknown)Sex: 1=male, 2=female (or M/F)Pheno: 1=unaffected, 2=affected
Output with pedigree includes additional columns:
father_gt,father_dp,father_gq,father_ad,father_vafmother_gt,mother_dp,mother_gq,mother_ad,mother_vaf
Advanced Filtering
PyWombat supports two types of filtering:
- Expression-based filtering: For rare variants, impact filtering, frequency filtering
- De novo mutation detection: Specialized logic for identifying DNMs in trios/families
1. Rare Variant Filtering
Filter for ultra-rare, high-impact variants:
uvx pywombat input.tsv \
-F examples/rare_variants_high_impact.yml \
-o rare_variants
Configuration (rare_variants_high_impact.yml):
quality:
sample_dp_min: 10 # Minimum read depth
sample_gq_min: 19 # Minimum genotype quality
sample_vaf_het_min: 0.25 # Het VAF range: 25-75%
sample_vaf_het_max: 0.75
sample_vaf_homalt_min: 0.85 # Hom alt VAF ≥ 85%
apply_to_parents: true # Also filter parent genotypes
filter_no_alt_allele: true # Exclude 0/0 genotypes
expression: "VEP_CANONICAL = YES & VEP_IMPACT = HIGH & VEP_LoF = HC & VEP_LoF_flags = . & ( fafmax_faf95_max_genomes = null | fafmax_faf95_max_genomes <= 0.001 )"
What this does:
- Filters for canonical transcripts with HIGH impact
- Loss-of-function (LoF) with high confidence, no flags
- Ultra-rare: ≤0.1% frequency in gnomAD genomes
- Stringent quality: DP≥10, GQ≥19, appropriate VAF
2. De Novo Mutation Detection
Identify de novo mutations in trio data:
uvx pywombat input.tsv \
--pedigree pedigree.tsv \
-F examples/de_novo_mutations.yml \
-o denovo
Configuration (de_novo_mutations.yml):
quality:
sample_dp_min: 10
sample_gq_min: 18
sample_vaf_min: 0.20
dnm:
enabled: true
parent_dp_min: 10 # Parent quality thresholds
parent_gq_min: 18
parent_vaf_max: 0.02 # Max 2% VAF in parents
sample_vaf_hemizygous_min: 0.85 # For X male, Y, and hom variants
fafmax_faf95_max_genomes_max: 0.001 # Max 0.1% in gnomAD
genomes_filters_pass_only: true # Only PASS variants
par_regions: # Pseudo-autosomal regions (GRCh38)
grch38:
PAR1:
chrom: X
start: 10000
end: 2781479
PAR2:
chrom: X
start: 155701383
end: 156030895
Sex-Chromosome Aware Logic:
- Autosomes & PAR: Both parents must be 0/0 with VAF<2%
- X in males (non-PAR): Mother must be 0/0, father not informative
- Y in males: Father must be 0/0, mother N/A
- Hemizygous variants: Require VAF ≥ 85%
Expression-Based Filtering
Create custom filters using logical expressions:
Available Operators
- Comparison:
=,!=,<,>,<=,>= - Logical:
&(AND),|(OR) - Grouping:
(,) - Null checks:
= null,!= null
Example Expressions
# High or moderate impact with low frequency
expression: "(VEP_IMPACT = HIGH | VEP_IMPACT = MODERATE) & gnomad_AF < 0.001"
# Canonical transcripts with CADD score
expression: "VEP_CANONICAL = YES & CADD_PHRED >= 20"
# Specific consequence types
expression: "VEP_Consequence = frameshift_variant | VEP_Consequence = stop_gained"
# Multiple criteria
expression: "VEP_IMPACT = HIGH & VEP_CANONICAL = YES & gnomad_AF < 0.01 & CADD_PHRED >= 25"
Debug Mode
Inspect specific variants for troubleshooting:
uvx pywombat input.tsv \
-F config.yml \
--debug chr11:70486013
Shows:
- All rows matching the position
- VEP_SYMBOL if available
- All columns mentioned in filter expression
- Useful for understanding why variants pass/fail filters
Output Formats
TSV (Default)
uvx pywombat input.tsv -o output # Creates output.tsv
uvx pywombat input.tsv -o output -f tsv # Same as above
Compressed TSV
uvx pywombat input.tsv -o output -f tsv.gz # Creates output.tsv.gz
Parquet (Fastest for Large Files)
uvx pywombat input.tsv -o output -f parquet # Creates output.parquet
When to use Parquet:
- Large cohorts (>100 samples)
- Downstream analysis with Polars/Pandas
- Need for fast repeated access
- Storage efficiency
Example Workflows
1. Rare Disease Gene Discovery
# Step 1: Filter for rare, high-impact variants
uvx pywombat cohort.tsv \
-F examples/rare_variants_high_impact.yml \
-o rare_variants
# Step 2: Further filter with gene list (in downstream analysis)
# Use the output TSV with your favorite tools (R, Python, etc.)
2. Autism Trio Analysis
# Identify de novo mutations in autism cohort
uvx pywombat autism_trios.tsv \
--pedigree autism_pedigree.tsv \
-F examples/de_novo_mutations.yml \
-o autism_denovo \
--verbose
# Review output for genes in autism risk lists
3. Multi-Family Rare Variant Analysis
# Process multiple families together
uvx pywombat families.tsv \
--pedigree families_pedigree.tsv \
-F examples/rare_variants_high_impact.yml \
-o families_rare_variants \
-f parquet # Parquet for fast downstream analysis
4. Custom Expression Filter
Create custom_filter.yml:
quality:
sample_dp_min: 15
sample_gq_min: 20
sample_vaf_het_min: 0.30
sample_vaf_het_max: 0.70
expression: "VEP_IMPACT = HIGH & (gnomad_AF < 0.0001 | gnomad_AF = null)"
Apply:
uvx pywombat input.tsv -F custom_filter.yml -o output
Input Requirements
bcftools Tabulated Format
PyWombat expects TSV files created with bcftools:
# From VCF to tabulated TSV
bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
input.vcf.gz > input.tsv
# With specific INFO fields
bcftools query -HH \
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO/DP;%INFO/AF;%INFO/AC[\t%GT:%DP:%GQ:%AD]\n' \
input.vcf.gz > input.tsv
Expected format:
- Tab-separated values
- Header row with column names (use
-HHfor proper headers) (null)column containing INFO fields- Sample columns in
GT:DP:GQ:ADformat (or similar) - Optional FILTER column for quality control
VEP Annotations
For expression-based filtering on VEP annotations, a two-step process is required:
Step 1: Split VEP CSQ Field
IMPORTANT: PyWombat requires VEP annotations to be split into individual fields with the VEP_ prefix:
# Split VEP CSQ field - creates one row per transcript/consequence
# This prefixes all CSQ columns with "VEP_" (e.g., VEP_SYMBOL, VEP_IMPACT)
bcftools +split-vep -c - -p VEP_ -O b -o annotated.split.bcf input.vcf.gz
Step 2: Convert to Tabulated Format
# Extract to TSV with genotype information
bcftools query -HH \
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
annotated.split.bcf > annotated.tsv
Complete VEP Workflow
# 1. Annotate with VEP
vep -i input.vcf.gz \
--cache --offline \
--format vcf \
--vcf \
--everything \
--canonical \
--plugin LoF \
-o annotated.vcf.gz
# 2. Split VEP CSQ field (REQUIRED for PyWombat)
bcftools +split-vep -c - -p VEP_ -O b -o annotated.split.bcf annotated.vcf.gz
# 3. Convert to tabulated format
bcftools query -HH \
-f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' \
annotated.split.bcf > annotated.tsv
# 4. Process with PyWombat
uvx pywombat annotated.tsv -F examples/rare_variants_high_impact.yml -o output
Why split-vep is required:
- Creates one row per transcript/consequence (instead of all in one CSQ field)
- Prefixes VEP fields with
VEP_making them accessible in expressions - Enables filtering on
VEP_CANONICAL,VEP_IMPACT,VEP_LoF, etc.
Pipeline Optimization
For production workflows, these commands can be piped together:
# Efficient pipeline (single pass through data)
bcftools +split-vep -c - -p VEP_ input.vcf.gz | \
bcftools query -HH -f '%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%INFO[\t%GT:%DP:%GQ:%AD]\n' | \
uvx pywombat - -F config.yml -o output
Note: For multiple filter configurations, it's more efficient to save the intermediate TSV file rather than regenerating it each time.
gnomAD Annotations
For frequency filtering, annotate with gnomAD:
# Using bcftools annotate
bcftools annotate -a gnomad.genomes.vcf.gz \
-c INFO/AF,INFO/fafmax_faf95_max \
input.vcf.gz -o annotated.vcf.gz
Configuration Examples
See the examples/ directory for complete configuration files:
rare_variants_high_impact.yml: Ultra-rare, high-impact variantsde_novo_mutations.yml: De novo mutation detection with sex-chromosome handling
Each configuration file is fully documented with:
- Parameter descriptions
- Use case recommendations
- Customization tips
- Example command lines
Performance Tips
- Use streaming mode (default): Efficient for most workflows
- Parquet output: Faster for large files and repeated analysis
- Pre-filter with bcftools: Filter by region/gene before PyWombat
- Compressed input: PyWombat handles
.gzfiles natively - Filter early: Apply quality filters before complex expression filters
Development
Setup
# Clone repository
git clone https://github.com/bourgeron-lab/pywombat.git
cd pywombat
# Install dependencies
uv sync
# Run tests (if available)
uv run pytest
Project Structure
pywombat/
├── src/pywombat/
│ ├── __init__.py
│ └── cli.py # Main CLI implementation
├── examples/ # Configuration examples
│ ├── README.md
│ ├── rare_variants_high_impact.yml
│ └── de_novo_mutations.yml
├── tests/ # Test files and data
├── pyproject.toml # Project metadata
└── README.md
Technology Stack
- Polars: Fast DataFrame operations
- Click: CLI interface
- PyYAML: Configuration parsing
- uv: Package management
Troubleshooting
Common Issues
Issue: Column '(null)' not found
- Solution: Ensure input is bcftools tabulated format with INFO column
Issue: No parent genotypes found
- Solution: Check pedigree file format and sample name matching
Issue: DNM filter requires pedigree
- Solution: Add
--pedigreeoption when using de novo config
Issue: Variants missing from output
- Solution: Use
--debug chr:posto see why variants are filtered
Issue: Memory errors on large files
- Solution: Files are processed in streaming mode by default; if issues persist, pre-filter with bcftools
Getting Help
- Check
--helpfor command options:uvx pywombat --help - Review example configurations in
examples/ - Use
--debugmode to inspect specific variants - Use
--verboseto see filtering steps
Citation
If you use PyWombat in your research, please cite:
PyWombat: A tool for processing and filtering bcftools tabulated files
Bourgeron Lab, Institut Pasteur
https://github.com/bourgeron-lab/pywombat
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Authors
- Freddy Cliquet - Bourgeron Lab, Institut Pasteur
Acknowledgments
- Bourgeron Lab for project support
- Institut Pasteur for infrastructure
- Polars team for the excellent DataFrame library
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pywombat-1.0.2.tar.gz.
File metadata
- Download URL: pywombat-1.0.2.tar.gz
- Upload date:
- Size: 40.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2701f1db6fe69c680f11ee46f1ab2b7ac065d50846c78912090e14b7fe701ee
|
|
| MD5 |
55759eb394da8ddcd1d5db0921a20825
|
|
| BLAKE2b-256 |
127ae4e95f234ebcf0d018d6ccb55bcfc147a4c42690149e882692e674026b4d
|
Provenance
The following attestation bundles were made for pywombat-1.0.2.tar.gz:
Publisher:
publish.yml on bourgeron-lab/pywombat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pywombat-1.0.2.tar.gz -
Subject digest:
a2701f1db6fe69c680f11ee46f1ab2b7ac065d50846c78912090e14b7fe701ee - Sigstore transparency entry: 910992498
- Sigstore integration time:
-
Permalink:
bourgeron-lab/pywombat@905121864fd626ac8fcd57b15d223812ce31cd0e -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/bourgeron-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@905121864fd626ac8fcd57b15d223812ce31cd0e -
Trigger Event:
release
-
Statement type:
File details
Details for the file pywombat-1.0.2-py3-none-any.whl.
File metadata
- Download URL: pywombat-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad1779acfe5a536728ef8b81fe5ea817d852aa6e1f2483def85fdf103afc389d
|
|
| MD5 |
a1ceecd828ed1e9f01cc69afc8b7b947
|
|
| BLAKE2b-256 |
f2fff2c85bb805dc09d4f5c42349c3b51e181202d9452f28a64d6decbb195d61
|
Provenance
The following attestation bundles were made for pywombat-1.0.2-py3-none-any.whl:
Publisher:
publish.yml on bourgeron-lab/pywombat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pywombat-1.0.2-py3-none-any.whl -
Subject digest:
ad1779acfe5a536728ef8b81fe5ea817d852aa6e1f2483def85fdf103afc389d - Sigstore transparency entry: 910992535
- Sigstore integration time:
-
Permalink:
bourgeron-lab/pywombat@905121864fd626ac8fcd57b15d223812ce31cd0e -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/bourgeron-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@905121864fd626ac8fcd57b15d223812ce31cd0e -
Trigger Event:
release
-
Statement type: