Skip to main content

A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support

Project description

PyWombat

A CLI tool for processing bcftools tabulated TSV files.

Installation

This is a UV-managed Python package. To install:

uv sync

Usage

The wombat command processes bcftools tabulated TSV files:

# Format a bcftools TSV file and print to stdout
wombat input.tsv

# Format and save to output file (creates output.tsv by default)
wombat input.tsv -o output

# Format and save as parquet
wombat input.tsv -o output -f parquet
wombat input.tsv -o output --format parquet

# Format with pedigree information to add parent genotypes
wombat input.tsv --pedigree pedigree.tsv -o output

What does wombat do?

The wombat command processes bcftools tabulated TSV files by:

  1. Expanding the (null) column: This column contains multiple fields in the format NAME=value separated by semicolons (e.g., DP=30;AF=0.5;AC=2). Each field is extracted into its own column.

  2. Preserving the CSQ column: The CSQ (Consequence) column is preserved as-is and not melted, allowing VEP annotations to remain intact.

  3. Melting and splitting sample columns: After the (null) column, there are typically sample columns with values in GT:DP:GQ:AD format. The tool:

    • Extracts the sample name (the part before the first : character)
    • Transforms the wide format into long format
    • Creates a sample column with the sample names
    • Splits the sample values into separate columns:
      • sample_gt: Genotype (e.g., 0/1, 1/1)
      • sample_dp: Read depth
      • sample_gq: Genotype quality
      • sample_ad: Allele depth (takes the second value from comma-separated list)
      • sample_vaf: Variant allele frequency (calculated as sample_ad / sample_dp)

Example

Input:

CHROM POS REF ALT (null) Sample1:GT:Sample1:DP:Sample1:GQ:Sample1:AD Sample2:GT:Sample2:DP:Sample2:GQ:Sample2:AD
chr1 100 A T DP=30;AF=0.5;AC=2 0/1:15:99:5,10 1/1:18:99:0,18

Output:

CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0

Notes:

  • The sample_ad column contains the second value from the AD field (e.g., from 5,10 it extracts 10)
  • The sample_vaf column is the variant allele frequency calculated as sample_ad / sample_dp
  • By default, output is in TSV format. Use -f parquet to output as Parquet files
  • The -o option specifies an output prefix (e.g., -o output creates output.tsv or output.parquet)

Pedigree Support

You can provide a pedigree file with the --pedigree option to add parent genotype information to the output. This enables trio analysis by including the father's and mother's genotypes for each sample.

Pedigree File Format:

The pedigree file should be a tab-separated file with the following columns:

  • FID: Family ID
  • sample_id: Sample identifier (matches the sample names in the VCF)
  • FatherBarcode: Father's sample identifier (use 0 or -9 if unknown)
  • MotherBarcode: Mother's sample identifier (use 0 or -9 if unknown)
  • Sex: Sex of the sample (optional)
  • Pheno: Phenotype information (optional)

Example pedigree file:

FID sample_id FatherBarcode MotherBarcode Sex Pheno
FAM1 Child1 Father1 Mother1 1 2
FAM1 Father1 0 0 1 1
FAM1 Mother1 0 0 2 1

Output with Pedigree:

When using --pedigree, the output will include additional columns for each parent:

  • father_gt, father_dp, father_gq, father_ad, father_vaf: Father's genotype information
  • mother_gt, mother_dp, mother_gq, mother_ad, mother_vaf: Mother's genotype information

These columns will contain the parent's genotype data for the same variant, allowing you to analyze inheritance patterns.

Development

This project uses:

  • UV for package management
  • Polars for fast data processing
  • Click for CLI interface

Testing

Test files are available in the tests/ directory:

  • test.tabulated.tsv - Real bcftools output
  • test_small.tsv - Small example for quick testing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywombat-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywombat-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file pywombat-0.1.0.tar.gz.

File metadata

  • Download URL: pywombat-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ba073e6e036a45a339a592dabb7dee323a46dc3cd08dd8c07602c2b687f37686
MD5 22f5d16ff8018988bfdc296a274a2905
BLAKE2b-256 c19490cd31e1cca6b5fa849b66450dd147b01bf31613c3664fc940e124aeec44

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.1.0.tar.gz:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pywombat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pywombat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pywombat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b129f18e47efb69bfed4e57f8bd53e0555ab166866ad7f17210a32ef4378006
MD5 80b169dbc954ac4e3a4bf4d72146f178
BLAKE2b-256 8aa394db0c9540d68c78566e2b8dd14066c7f52dce0b0e17768b9940fe612bf8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pywombat-0.1.0-py3-none-any.whl:

Publisher: publish.yml on bourgeron-lab/pywombat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page