Skip to main content

A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support

Project description

PyWombat

A CLI tool for processing bcftools tabulated TSV files.

Installation

This is a UV-managed Python package. To install:

uv sync

Usage

The wombat command processes bcftools tabulated TSV files:

# Format a bcftools TSV file and print to stdout
wombat input.tsv

# Format and save to output file (creates output.tsv by default)
wombat input.tsv -o output

# Format and save as parquet
wombat input.tsv -o output -f parquet
wombat input.tsv -o output --format parquet

# Format with pedigree information to add parent genotypes
wombat input.tsv --pedigree pedigree.tsv -o output

What does wombat do?

The wombat command processes bcftools tabulated TSV files by:

  1. Expanding the (null) column: This column contains multiple fields in the format NAME=value separated by semicolons (e.g., DP=30;AF=0.5;AC=2). Each field is extracted into its own column.

  2. Preserving the CSQ column: The CSQ (Consequence) column is preserved as-is and not melted, allowing VEP annotations to remain intact.

  3. Melting and splitting sample columns: After the (null) column, there are typically sample columns with values in GT:DP:GQ:AD format. The tool:

    • Extracts the sample name (the part before the first : character)
    • Transforms the wide format into long format
    • Creates a sample column with the sample names
    • Splits the sample values into separate columns:
      • sample_gt: Genotype (e.g., 0/1, 1/1)
      • sample_dp: Read depth
      • sample_gq: Genotype quality
      • sample_ad: Allele depth (takes the second value from comma-separated list)
      • sample_vaf: Variant allele frequency (calculated as sample_ad / sample_dp)

Example

Input:

CHROM POS REF ALT (null) Sample1:GT:Sample1:DP:Sample1:GQ:Sample1:AD Sample2:GT:Sample2:DP:Sample2:GQ:Sample2:AD
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18

Output:

CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0

Notes:

  • The sample_ad column contains the second value from the AD field (e.g., from 5,10 it extracts 10)
  • The sample_vaf column is the variant allele frequency calculated as sample_ad / sample_dp
  • By default, output is in TSV format. Use -f parquet to output as Parquet files
  • The -o option specifies an output prefix (e.g., -o output creates output.tsv or output.parquet)

Pedigree Support

You can provide a pedigree file with the --pedigree option to add parent genotype information to the output. This enables trio analysis by including the father's and mother's genotypes for each sample.

Pedigree File Format:

The pedigree file should be a tab-separated file with the following columns:

  • FID: Family ID
  • sample_id: Sample identifier (matches the sample names in the VCF)
  • FatherBarcode: Father's sample identifier (use 0 or -9 if unknown)
  • MotherBarcode: Mother's sample identifier (use 0 or -9 if unknown)
  • Sex: Sex of the sample (optional)
  • Pheno: Phenotype information (optional)

Example pedigree file:

FID sample_id FatherBarcode MotherBarcode Sex Pheno
FAM1 Child1 Father1 Mother1 1 2
FAM1 Father1 0 0 1 1
FAM1 Mother1 0 0 2 1

Output with Pedigree:

When using --pedigree, the output will include additional columns for each parent:

  • father_gt, father_dp, father_gq, father_ad, father_vaf: Father's genotype information
  • mother_gt, mother_dp, mother_gq, mother_ad, mother_vaf: Mother's genotype information

These columns will contain the parent's genotype data for the same variant, allowing you to analyze inheritance patterns.

Development

This project uses:

  • UV for package management
  • Polars for fast data processing
  • Click for CLI interface

Testing

Test files are available in the tests/ directory:

  • test.tabulated.tsv - Real bcftools output
  • test_small.tsv - Small example for quick testing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywombat-0.4.0.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywombat-0.4.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file pywombat-0.4.0.tar.gz.

File metadata

  • Download URL: pywombat-0.4.0.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6c033750df73316f9e5b892e61eecdff6230a989cc50977414a3677e491a5c4f
MD5 a6d3ff29ef9930d2a4d9b204cfe8df4e
BLAKE2b-256 6481d9c13f2178beb8905d3c5ea34958ef99b0d425381c06dee205449f6d2442

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.4.0.tar.gz:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pywombat-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pywombat-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a6b395f13e277497a9f2aabe2925a98c4471c8766f232c33dc2e7eff2c8c5a3
MD5 869fffbcd7129d3c9d5e646a18e0b496
BLAKE2b-256 96831f3fc81e905e605e237c49b565cdcb9a6ae96b7935abe6fce8459ee80ea3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.4.0-py3-none-any.whl:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page