Skip to main content

A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support

Project description

PyWombat

A CLI tool for processing bcftools tabulated TSV files.

Installation

This is a UV-managed Python package. To install:

uv sync

Usage

The wombat command processes bcftools tabulated TSV files:

# Format a bcftools TSV file and print to stdout
wombat input.tsv

# Format and save to output file (creates output.tsv by default)
wombat input.tsv -o output

# Format and save as parquet
wombat input.tsv -o output -f parquet
wombat input.tsv -o output --format parquet

# Format with pedigree information to add parent genotypes
wombat input.tsv --pedigree pedigree.tsv -o output

What does wombat do?

The wombat command processes bcftools tabulated TSV files by:

  1. Expanding the (null) column: This column contains multiple fields in the format NAME=value separated by semicolons (e.g., DP=30;AF=0.5;AC=2). Each field is extracted into its own column.

  2. Preserving the CSQ column: The CSQ (Consequence) column is preserved as-is and not melted, allowing VEP annotations to remain intact.

  3. Melting and splitting sample columns: After the (null) column, there are typically sample columns with values in GT:DP:GQ:AD format. The tool:

    • Extracts the sample name (the part before the first : character)
    • Transforms the wide format into long format
    • Creates a sample column with the sample names
    • Splits the sample values into separate columns:
      • sample_gt: Genotype (e.g., 0/1, 1/1)
      • sample_dp: Read depth
      • sample_gq: Genotype quality
      • sample_ad: Allele depth (takes the second value from comma-separated list)
      • sample_vaf: Variant allele frequency (calculated as sample_ad / sample_dp)

Example

Input:

CHROM POS REF ALT (null) Sample1:GT:Sample1:DP:Sample1:GQ:Sample1:AD Sample2:GT:Sample2:DP:Sample2:GQ:Sample2:AD
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18

Output:

CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0

Notes:

  • The sample_ad column contains the second value from the AD field (e.g., from 5,10 it extracts 10)
  • The sample_vaf column is the variant allele frequency calculated as sample_ad / sample_dp
  • By default, output is in TSV format. Use -f parquet to output as Parquet files
  • The -o option specifies an output prefix (e.g., -o output creates output.tsv or output.parquet)

Pedigree Support

You can provide a pedigree file with the --pedigree option to add parent genotype information to the output. This enables trio analysis by including the father's and mother's genotypes for each sample.

Pedigree File Format:

The pedigree file should be a tab-separated file with the following columns:

  • FID: Family ID
  • sample_id: Sample identifier (matches the sample names in the VCF)
  • FatherBarcode: Father's sample identifier (use 0 or -9 if unknown)
  • MotherBarcode: Mother's sample identifier (use 0 or -9 if unknown)
  • Sex: Sex of the sample (optional)
  • Pheno: Phenotype information (optional)

Example pedigree file:

FID sample_id FatherBarcode MotherBarcode Sex Pheno
FAM1 Child1 Father1 Mother1 1 2
FAM1 Father1 0 0 1 1
FAM1 Mother1 0 0 2 1

Output with Pedigree:

When using --pedigree, the output will include additional columns for each parent:

  • father_gt, father_dp, father_gq, father_ad, father_vaf: Father's genotype information
  • mother_gt, mother_dp, mother_gq, mother_ad, mother_vaf: Mother's genotype information

These columns will contain the parent's genotype data for the same variant, allowing you to analyze inheritance patterns.

Development

This project uses:

  • UV for package management
  • Polars for fast data processing
  • Click for CLI interface

Testing

Test files are available in the tests/ directory:

  • test.tabulated.tsv - Real bcftools output
  • test_small.tsv - Small example for quick testing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywombat-0.2.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywombat-0.2.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file pywombat-0.2.0.tar.gz.

File metadata

  • Download URL: pywombat-0.2.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d03a760f46b489f94d653c151fc762ab3643f3724d9bd89e8f766d6e1c2a2e12
MD5 90bbf83dd20899a34b3a87e011dc6ef1
BLAKE2b-256 46b60c44da7be7f1d72261a61f1163d42969fccef71d3b5d406eaab15a7d4d22

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.2.0.tar.gz:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pywombat-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pywombat-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e07bd2895d1c37f68582294a3bfa4d85c2c76be18ef3c746308b42754568bda
MD5 a41a95928870d4da6895581f66ed94d7
BLAKE2b-256 c0a2463cfff156fcb0c92faade054f19b317c1b8646839cc5c6953bdbeac53af

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.2.0-py3-none-any.whl:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page