No project description provided

Project description

Ferromic ⚙️

Ferromic is a place for multiple bioinformatics tools for VCFs written in Rust. Right now, this includes file merging and region statistics calculation.

VCF Statistics Calculator 📊

Welcome to the VCF Statistics Calculator, a Rust-based tool designed to compute Watterson's Theta (θ) and Pi (π) (and others, coming soon) for genomic regions defined in VCF (Variant Call Format) files.

Features ✨

Calculate Genetic Diversity Metrics: Compute Watterson's Theta (θ) and Pi (π) for specified regions.
Haplotype Group Analysis: Separate calculations for haplotypes with and without a structural variant class (such as inversions).
Flexible Input Handling: Supports configuration via TSV files for multiple regions and haplotype groupings (e.g., by presence of structural variant).
Filtering: Filter variants based on genotype quality (GQ) scores and predefined genomic masks.
Output: Generates CSV files with statistical metrics for each genomic region.

Background 🧬

This tool processes VCF files to calculate Watterson's Theta (θ) and Pi (π), metrics for understanding genetic diversity within genomic regions. By using a TSV configuration file, users can define multiple regions and categorize haplotypes based on SV (e.g. inversion) statuses. This allows for distinguishing between inverted and non-inverted haplotypes across different regions and samples. Note that if you are using VCFs, all sites which differ between samples must be included.

Metrics:

Watterson's Theta (θ): Based on the number of segregating (polymorphic) sites.
Pi (π): Measures nucleotide diversity, based on pairwise per-site nucleotide differences.

Installation 🛠️

The easiest way to install the project is to run this command in your terminal:

curl -sSL https://github.com/ScottSauers/ferromic/raw/main/install.sh | bash

Or, you can download the binary from the releases section.

Or, make sure you have Rust and Cargo installed, and you can clone the repository and build the project.

Usage 🚀

Command-Line Arguments

vcf_stats_calculator -v <VCF_FOLDER> \
                    -c <CONFIG_FILE> \
                    -o <OUTPUT_CSV> \
                    --min_gq <MIN_GQ> \
                    --mask_file <MASK_FILE> \
                    -h <CHR> \
                    -r <REGION>

Parameters:

-v, --vcf_folder: (Required) Path to the directory containing VCF files.
-c, --config_file: (Optional) Path to the TSV configuration file defining regions and haplotype groupings.
-o, --output_file: (Optional) Path for the output CSV file containing statistical results. Defaults to output.csv if not specified.
--min_gq: (Optional) Minimum genotype quality (GQ) Phred score for filtering variants. Defaults to 30.
--mask_file: (Optional) Path to the BED file specifying genomic regions to mask (filter out).
-h, --chr: (Optional) Chromosome name to process when not using a config file.
-r, --region: (Optional) Specific region to process within the chromosome, in the format start-end (e.g., 10732039-23685112).

Notes:

Either --config_file or both --chr and --region must be provided.
When using --config_file, the tool can process multiple regions and haplotype groupings as defined in the TSV.
When not using a config file, the tool will process the specified chromosome and region and output results to the console.

Input Files

VCF File 🧬

Format: VCF v4.2
Contents: Variant data including positions, alleles, and genotype information for multiple samples.
Genotype Format: Must include GT (genotype) and GQ (genotype quality) fields.

Example:

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred scaled genotype quality computed by whatshap genotyping algorithm.">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr1 1500 . A T . PASS . GT:GQ 0|0:35 0|1:40

TSV Configuration File 📋

Purpose: Defines multiple genomic regions and specifies haplotype groupings based on inversion statuses.
Structure:
- Columns:
  - seqnames: Chromosome name (e.g., chr1).
  - start: Start position of the region.
  - end: End position of the region.
  - Sample Columns: Genotype information for each sample in the format 0|1, 1|0, 0|0, 1|1, etc.

Example:

seqnames	start	end	POS	orig_ID	verdict	categ	NA19434	HG00036	HG00191	...
chr1	13004251	13122531	13113384	chr1-13113384-INV-62181	pass	inv	1|1	1|1	1|1	...

Notes:

Genotypes beyond the standard 0|0, 0|1, 1|0, 1|1 (e.g., 0|1_lowconf) will be used only for the "unfiltered" outputs.
Haplotype groupings (presence or absence) are determined by the values in the genotype columns, indicating, e.g., inversion (1) or direct (0) haplotypes.

Mask File 🛡️

Format: BED
Contains: Genomic regions to exclude from analysis (variants within the regions will betreated similarly to variants with low GQ scores).
Structure:
- Three columns: chromosome, start, end.

Example:

chr1	0	77102
chr1	88113	190752
...

Output File 📈

Format: CSV
Filename: As specified by the --output_file parameter.

Headers:

chr,region_start,region_end,0_sequence_length,1_sequence_length,0_sequence_length_adjusted,1_sequence_length_adjusted,0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,inversion_freq_no_filter,inversion_freq_filter

Column Descriptions:
- chr: Chromosome name.
- region_start: Start position of the region.
- region_end: End position of the region.
- 0_sequence_length: Total length of the sequence for haplotype group 0.
- 1_sequence_length: Total length of the sequence for haplotype group 1.
- 0_sequence_length_adjusted: Adjusted sequence length for haplotype group 0 after filtering.
- 1_sequence_length_adjusted: Adjusted sequence length for haplotype group 1 after filtering.
- 0_segregating_sites: Number of segregating sites (unfiltered) for haplotype group 0.
- 1_segregating_sites: Number of segregating sites (unfiltered) for haplotype group 1.
- 0_w_theta: Watterson's Theta (unfiltered) for haplotype group 0.
- 1_w_theta: Watterson's Theta (unfiltered) for haplotype group 1.
- 0_pi: Pi (unfiltered) for haplotype group 0.
- 1_pi: Pi (unfiltered) for haplotype group 1.
- 0_segregating_sites_filtered: Segregating sites for haplotype group 0.
- 1_segregating_sites_filtered: Segregating sites for haplotype group 1.
- 0_w_theta_filtered: Watterson's Theta for haplotype group 0.
- 1_w_theta_filtered: Watterson's Theta for haplotype group 1.
- 0_pi_filtered: Pi for haplotype group 0.
- 1_pi_filtered: Pi for haplotype group 1.
- 0_num_hap_no_filter: Number of haplotypes for group 0 before filtering.
- 1_num_hap_no_filter: Number of haplotypes for group 1 before filtering.
- 0_num_hap_filter: Number of haplotypes for group 0.
- 1_num_hap_filter: Number of haplotypes for group 1.
- inversion_freq_no_filter: Allele frequency of inversion (1) before filtering.
- inversion_freq_filter: Allele frequency of inversion (1).
Special Values:
- θ = 0: No segregating sites; no genetic variation observed.
- θ = Infinity (inf): Insufficient haplotypes or zero-length region; metrics undefined.
- π = 0: No nucleotide differences.
- π = Infinity (inf): Insufficient data; metrics undefined.

Filtering Mechanisms 🔍

Genotype Quality (GQ) Filtering

Purpose: Exclude variants with low genotype quality.
Mechanism:
- If any sample within a variant has a GQ score below the specified --min_gq threshold, the entire variant is excluded from filtered (but not unfiltered) analyses.
- Variants passing the GQ filter are included in both unfiltered and filtered analyses.

Genotype Matching

Purpose: Only exact genotype matches (0|0, 0|1, 1|0, 1|1) are included in filtered analyses.
Mechanism:
- Genotypes not strictly matching the four expected formats (e.g., 0|1_lowconf) are considered missing data and excluded from filtered analyses.
- Unfiltered analyses include all genotypes that can be parsed into valid formats based on the first three characters.

Masking

Purpose: Exclude entire genomic regions from analysis based on predefined masks.
Mechanism:
- Regions specified in the BED mask file are treated similarly to low GQ variants and excluded from filtered (but not unfiltered) analyses.
- Sequence lengths are adjusted to account for masked regions in statistic calculations

Common Warnings and Errors ⚠️

Missing Samples: If certain samples defined in the configuration file are not found in the VCF, a warning is displayed with the missing samples.
Invalid Genotypes: Genotypes not conforming to the expected formats (0|0, 0|1, 1|0, 1|1) will be considered missing data. The number and percentage of invalid genotypes encountered will be shown.
Multi-allelic Sites: Multi-allelic variants are not supported.
No Variants Found: If no variants are found within the specified region or all variants are filtered out, a warning will be printed.

Examples 🧪

Running the Tool with All Parameters

cargo run --release --bin vcf_stats_calculator \
    -v ../vcfs \
    -c ../config/regions.tsv \
    -o ../results/output_stats.csv \
    --min_gq 30 \
    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed

Running the Tool Without a Configuration File

If you prefer to calculate statistics for a specific chromosome and region without using a configuration file, you can run the tool with the --chr and --region flags. Note: In this mode, the results will be printed to the console rather than written to a CSV file.

cargo run --release --bin vcf_stats_calculator \
    -v ../vcfs \
    -c chr22 \
    -r 10732039-23685112 \
    --min_gq 30 \
    --mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed

Project details

Release history Release notifications | RSS feed

0.1.4

Nov 23, 2025

0.1.3

Sep 30, 2025

0.1.2

Sep 30, 2025

0.1.1

Sep 24, 2025

This version

0.1.0

Feb 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ferromic-0.1.0.tar.gz (2.2 MB view details)

Uploaded Sep 24, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl (903.1 kB view details)

Uploaded Sep 24, 2025 CPython 3.13manylinux: glibc 2.39+ ARM64

ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl (210.2 kB view details)

Uploaded Feb 9, 2025 CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ferromic-0.1.0.tar.gz.

File metadata

Download URL: ferromic-0.1.0.tar.gz
Upload date: Sep 24, 2025
Size: 2.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for ferromic-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`50c06deb846635458e6ec1da26e97899045b2bf52646557c32ccded81bdf9854`
MD5	`1f0225d1c1f62964744e248b9bff260a`
BLAKE2b-256	`81e7561758507334c6eccdabffc64c655036449fe37643e642d4c0f4eaa6f43e`

See more details on using hashes here.

File details

Details for the file ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl.

File metadata

Download URL: ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl
Upload date: Sep 24, 2025
Size: 903.1 kB
Tags: CPython 3.13, manylinux: glibc 2.39+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl
Algorithm	Hash digest
SHA256	`4268bfa76ef50dd706bd549f09487a6ffaa5f22f0eef0ce788a1b265b63a34b0`
MD5	`3b92ecc291cf984fe34753459588e2e2`
BLAKE2b-256	`b9e223e871f8fe5b380e8c0436883d96132df2a5bd2f45e938f667659ecd9626`

See more details on using hashes here.

File details

Details for the file ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
Upload date: Feb 9, 2025
Size: 210.2 kB
Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.8.2

File hashes

Hashes for ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`c6874ca59bf84feeea19bd96d72c3c14b9930a9d2b3856c1b359d008b76a214b`
MD5	`5abcb6ccff65dfbf2a94bba5e859e674`
BLAKE2b-256	`3623cadabd3a7c15c1b6ea829df37eead0293bb4e3390aff0b3bb54ec4ee6a58`

See more details on using hashes here.

ferromic 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project description

Ferromic ⚙️

VCF Statistics Calculator 📊

Features ✨

Background 🧬

Installation 🛠️

Usage 🚀

Command-Line Arguments

Input Files

VCF File 🧬

TSV Configuration File 📋

Mask File 🛡️

Output File 📈

Filtering Mechanisms 🔍

Genotype Quality (GQ) Filtering

Genotype Matching

Masking

Common Warnings and Errors ⚠️

Examples 🧪

Running the Tool with All Parameters

Running the Tool Without a Configuration File

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes