Skip to main content

A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support

Project description

PyWombat

A CLI tool for processing bcftools tabulated TSV files.

Installation

This is a UV-managed Python package. To install:

uv sync

Usage

The wombat command processes bcftools tabulated TSV files:

# Format a bcftools TSV file and print to stdout
wombat input.tsv

# Format and save to output file (creates output.tsv by default)
wombat input.tsv -o output

# Format and save as parquet
wombat input.tsv -o output -f parquet
wombat input.tsv -o output --format parquet

# Format with pedigree information to add parent genotypes
wombat input.tsv --pedigree pedigree.tsv -o output

What does wombat do?

The wombat command processes bcftools tabulated TSV files by:

  1. Expanding the (null) column: This column contains multiple fields in the format NAME=value separated by semicolons (e.g., DP=30;AF=0.5;AC=2). Each field is extracted into its own column.

  2. Preserving the CSQ column: The CSQ (Consequence) column is preserved as-is and not melted, allowing VEP annotations to remain intact.

  3. Melting and splitting sample columns: After the (null) column, there are typically sample columns with values in GT:DP:GQ:AD format. The tool:

    • Extracts the sample name (the part before the first : character)
    • Transforms the wide format into long format
    • Creates a sample column with the sample names
    • Splits the sample values into separate columns:
      • sample_gt: Genotype (e.g., 0/1, 1/1)
      • sample_dp: Read depth
      • sample_gq: Genotype quality
      • sample_ad: Allele depth (takes the second value from comma-separated list)
      • sample_vaf: Variant allele frequency (calculated as sample_ad / sample_dp)

Example

Input:

CHROM POS REF ALT (null) Sample1:GT:Sample1:DP:Sample1:GQ:Sample1:AD Sample2:GT:Sample2:DP:Sample2:GQ:Sample2:AD
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18

Output:

CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0

Notes:

  • The sample_ad column contains the second value from the AD field (e.g., from 5,10 it extracts 10)
  • The sample_vaf column is the variant allele frequency calculated as sample_ad / sample_dp
  • By default, output is in TSV format. Use -f parquet to output as Parquet files
  • The -o option specifies an output prefix (e.g., -o output creates output.tsv or output.parquet)

Pedigree Support

You can provide a pedigree file with the --pedigree option to add parent genotype information to the output. This enables trio analysis by including the father's and mother's genotypes for each sample.

Pedigree File Format:

The pedigree file should be a tab-separated file with the following columns:

  • FID: Family ID
  • sample_id: Sample identifier (matches the sample names in the VCF)
  • FatherBarcode: Father's sample identifier (use 0 or -9 if unknown)
  • MotherBarcode: Mother's sample identifier (use 0 or -9 if unknown)
  • Sex: Sex of the sample (optional)
  • Pheno: Phenotype information (optional)

Example pedigree file:

FID sample_id FatherBarcode MotherBarcode Sex Pheno
FAM1 Child1 Father1 Mother1 1 2
FAM1 Father1 0 0 1 1
FAM1 Mother1 0 0 2 1

Output with Pedigree:

When using --pedigree, the output will include additional columns for each parent:

  • father_gt, father_dp, father_gq, father_ad, father_vaf: Father's genotype information
  • mother_gt, mother_dp, mother_gq, mother_ad, mother_vaf: Mother's genotype information

These columns will contain the parent's genotype data for the same variant, allowing you to analyze inheritance patterns.

Development

This project uses:

  • UV for package management
  • Polars for fast data processing
  • Click for CLI interface

Testing

Test files are available in the tests/ directory:

  • test.tabulated.tsv - Real bcftools output
  • test_small.tsv - Small example for quick testing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywombat-0.3.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywombat-0.3.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file pywombat-0.3.0.tar.gz.

File metadata

  • Download URL: pywombat-0.3.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8508c5c5ba7ac225e8dca493e45836a3925c76520f612a4673576d5fa4f2f0f1
MD5 4cf42b2d1503e8fe4ae7380356fe49e2
BLAKE2b-256 f52c7311239ea0495f8aba968fb6dc6e305787656933d2a62569a098db31530d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.3.0.tar.gz:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pywombat-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pywombat-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 469f9f2f7273e7e41408895e42f3d72f71f1c5e1890f6f8443d990f27a33dc66
MD5 b78b606f1aa6889d40e7998a55cef7a9
BLAKE2b-256 5fefa4bed5bc8d537f28c0bf03ca435d2a4803956476e4224ad75e7eb8680e9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.3.0-py3-none-any.whl:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page