Skip to main content

No project description provided

Project description

Ferromic ⚙️

Ferromic is a place for multiple bioinformatics tools for VCFs written in Rust. Right now, this includes file merging and region statistics calculation.

VCF Statistics Calculator 📊

Welcome to the VCF Statistics Calculator, a Rust-based tool designed to compute Watterson's Theta (θ) and Pi (π) (and others, coming soon) for genomic regions defined in VCF (Variant Call Format) files.


Features ✨

  • Calculate Genetic Diversity Metrics: Compute Watterson's Theta (θ) and Pi (π) for specified regions.
  • Haplotype Group Analysis: Separate calculations for haplotypes with and without a structural variant class (such as inversions).
  • Flexible Input Handling: Supports configuration via TSV files for multiple regions and haplotype groupings (e.g., by presence of structural variant).
  • Filtering: Filter variants based on genotype quality (GQ) scores and predefined genomic masks.
  • Output: Generates CSV files with statistical metrics for each genomic region.

Background 🧬

This tool processes VCF files to calculate Watterson's Theta (θ) and Pi (π), metrics for understanding genetic diversity within genomic regions. By using a TSV configuration file, users can define multiple regions and categorize haplotypes based on SV (e.g. inversion) statuses. This allows for distinguishing between inverted and non-inverted haplotypes across different regions and samples. Note that if you are using VCFs, all sites which differ between samples must be included.

Metrics:

  • Watterson's Theta (θ): Based on the number of segregating (polymorphic) sites.
  • Pi (π): Measures nucleotide diversity, based on pairwise per-site nucleotide differences.

Installation 🛠️

The easiest way to install the project is to run this command in your terminal:

curl -sSL https://github.com/ScottSauers/ferromic/raw/main/install.sh | bash

Or, you can download the binary from the releases section.

Or, make sure you have Rust and Cargo installed, and you can clone the repository and build the project.


Usage 🚀

Command-Line Arguments

vcf_stats_calculator -v <VCF_FOLDER> \
                    -c <CONFIG_FILE> \
                    -o <OUTPUT_CSV> \
                    --min_gq <MIN_GQ> \
                    --mask_file <MASK_FILE> \
                    -h <CHR> \
                    -r <REGION>

Parameters:

  • -v, --vcf_folder: (Required) Path to the directory containing VCF files.
  • -c, --config_file: (Optional) Path to the TSV configuration file defining regions and haplotype groupings.
  • -o, --output_file: (Optional) Path for the output CSV file containing statistical results. Defaults to output.csv if not specified.
  • --min_gq: (Optional) Minimum genotype quality (GQ) Phred score for filtering variants. Defaults to 30.
  • --mask_file: (Optional) Path to the BED file specifying genomic regions to mask (filter out).
  • -h, --chr: (Optional) Chromosome name to process when not using a config file.
  • -r, --region: (Optional) Specific region to process within the chromosome, in the format start-end (e.g., 10732039-23685112).

Notes:

  • Either --config_file or both --chr and --region must be provided.
  • When using --config_file, the tool can process multiple regions and haplotype groupings as defined in the TSV.
  • When not using a config file, the tool will process the specified chromosome and region and output results to the console.

Input Files

VCF File 🧬

  • Format: VCF v4.2
  • Contents: Variant data including positions, alleles, and genotype information for multiple samples.
  • Genotype Format: Must include GT (genotype) and GQ (genotype quality) fields.

Example:

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred scaled genotype quality computed by whatshap genotyping algorithm.">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr1 1500 . A T . PASS . GT:GQ 0|0:35 0|1:40

TSV Configuration File 📋

  • Purpose: Defines multiple genomic regions and specifies haplotype groupings based on inversion statuses.
  • Structure:
    • Columns:
      • seqnames: Chromosome name (e.g., chr1).
      • start: Start position of the region.
      • end: End position of the region.
      • Sample Columns: Genotype information for each sample in the format 0|1, 1|0, 0|0, 1|1, etc.

Example:

seqnames	start	end	POS	orig_ID	verdict	categ	NA19434	HG00036	HG00191	...
chr1	13004251	13122531	13113384	chr1-13113384-INV-62181	pass	inv	1|1	1|1	1|1	...

Notes:

  • Genotypes beyond the standard 0|0, 0|1, 1|0, 1|1 (e.g., 0|1_lowconf) will be used only for the "unfiltered" outputs.
  • Haplotype groupings (presence or absence) are determined by the values in the genotype columns, indicating, e.g., inversion (1) or direct (0) haplotypes.

Mask File 🛡️

  • Format: BED
  • Contains: Genomic regions to exclude from analysis (variants within the regions will betreated similarly to variants with low GQ scores).
  • Structure:
    • Three columns: chromosome, start, end.

Example:

chr1	0	77102
chr1	88113	190752
...

Output File 📈

  • Format: CSV

  • Filename: As specified by the --output_file parameter.

  • Headers:

    chr,region_start,region_end,0_sequence_length,1_sequence_length,0_sequence_length_adjusted,1_sequence_length_adjusted,0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,inversion_freq_no_filter,inversion_freq_filter
    
  • Column Descriptions:

    • chr: Chromosome name.
    • region_start: Start position of the region.
    • region_end: End position of the region.
    • 0_sequence_length: Total length of the sequence for haplotype group 0.
    • 1_sequence_length: Total length of the sequence for haplotype group 1.
    • 0_sequence_length_adjusted: Adjusted sequence length for haplotype group 0 after filtering.
    • 1_sequence_length_adjusted: Adjusted sequence length for haplotype group 1 after filtering.
    • 0_segregating_sites: Number of segregating sites (unfiltered) for haplotype group 0.
    • 1_segregating_sites: Number of segregating sites (unfiltered) for haplotype group 1.
    • 0_w_theta: Watterson's Theta (unfiltered) for haplotype group 0.
    • 1_w_theta: Watterson's Theta (unfiltered) for haplotype group 1.
    • 0_pi: Pi (unfiltered) for haplotype group 0.
    • 1_pi: Pi (unfiltered) for haplotype group 1.
    • 0_segregating_sites_filtered: Segregating sites for haplotype group 0.
    • 1_segregating_sites_filtered: Segregating sites for haplotype group 1.
    • 0_w_theta_filtered: Watterson's Theta for haplotype group 0.
    • 1_w_theta_filtered: Watterson's Theta for haplotype group 1.
    • 0_pi_filtered: Pi for haplotype group 0.
    • 1_pi_filtered: Pi for haplotype group 1.
    • 0_num_hap_no_filter: Number of haplotypes for group 0 before filtering.
    • 1_num_hap_no_filter: Number of haplotypes for group 1 before filtering.
    • 0_num_hap_filter: Number of haplotypes for group 0.
    • 1_num_hap_filter: Number of haplotypes for group 1.
    • inversion_freq_no_filter: Allele frequency of inversion (1) before filtering.
    • inversion_freq_filter: Allele frequency of inversion (1).
  • Special Values:

    • θ = 0: No segregating sites; no genetic variation observed.
    • θ = Infinity (inf): Insufficient haplotypes or zero-length region; metrics undefined.
    • π = 0: No nucleotide differences.
    • π = Infinity (inf): Insufficient data; metrics undefined.

Filtering Mechanisms 🔍

Genotype Quality (GQ) Filtering

  • Purpose: Exclude variants with low genotype quality.
  • Mechanism:
    • If any sample within a variant has a GQ score below the specified --min_gq threshold, the entire variant is excluded from filtered (but not unfiltered) analyses.
    • Variants passing the GQ filter are included in both unfiltered and filtered analyses.

Genotype Matching

  • Purpose: Only exact genotype matches (0|0, 0|1, 1|0, 1|1) are included in filtered analyses.
  • Mechanism:
    • Genotypes not strictly matching the four expected formats (e.g., 0|1_lowconf) are considered missing data and excluded from filtered analyses.
    • Unfiltered analyses include all genotypes that can be parsed into valid formats based on the first three characters.

Masking

  • Purpose: Exclude entire genomic regions from analysis based on predefined masks.
  • Mechanism:
    • Regions specified in the BED mask file are treated similarly to low GQ variants and excluded from filtered (but not unfiltered) analyses.
    • Sequence lengths are adjusted to account for masked regions in statistic calculations

Common Warnings and Errors ⚠️

  • Missing Samples: If certain samples defined in the configuration file are not found in the VCF, a warning is displayed with the missing samples.

  • Invalid Genotypes: Genotypes not conforming to the expected formats (0|0, 0|1, 1|0, 1|1) will be considered missing data. The number and percentage of invalid genotypes encountered will be shown.

  • Multi-allelic Sites: Multi-allelic variants are not supported.

  • No Variants Found: If no variants are found within the specified region or all variants are filtered out, a warning will be printed.


Examples 🧪

Running the Tool with All Parameters

cargo run --release --bin vcf_stats_calculator \
    -v ../vcfs \
    -c ../config/regions.tsv \
    -o ../results/output_stats.csv \
    --min_gq 30 \
    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed

Running the Tool Without a Configuration File

If you prefer to calculate statistics for a specific chromosome and region without using a configuration file, you can run the tool with the --chr and --region flags. Note: In this mode, the results will be printed to the console rather than written to a CSV file.

cargo run --release --bin vcf_stats_calculator \
    -v ../vcfs \
    -c chr22 \
    -r 10732039-23685112 \
    --min_gq 30 \
    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ferromic-0.1.0.tar.gz (2.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl (903.1 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.39+ ARM64

ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl (210.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ferromic-0.1.0.tar.gz.

File metadata

  • Download URL: ferromic-0.1.0.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for ferromic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 50c06deb846635458e6ec1da26e97899045b2bf52646557c32ccded81bdf9854
MD5 1f0225d1c1f62964744e248b9bff260a
BLAKE2b-256 81e7561758507334c6eccdabffc64c655036449fe37643e642d4c0f4eaa6f43e

See more details on using hashes here.

File details

Details for the file ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 4268bfa76ef50dd706bd549f09487a6ffaa5f22f0eef0ce788a1b265b63a34b0
MD5 3b92ecc291cf984fe34753459588e2e2
BLAKE2b-256 b9e223e871f8fe5b380e8c0436883d96132df2a5bd2f45e938f667659ecd9626

See more details on using hashes here.

File details

Details for the file ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c6874ca59bf84feeea19bd96d72c3c14b9930a9d2b3856c1b359d008b76a214b
MD5 5abcb6ccff65dfbf2a94bba5e859e674
BLAKE2b-256 3623cadabd3a7c15c1b6ea829df37eead0293bb4e3390aff0b3bb54ec4ee6a58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page