No project description provided
Project description
Ferromic ⚙️
Ferromic is a place for multiple bioinformatics tools for VCFs written in Rust. Right now, this includes file merging and region statistics calculation.
VCF Statistics Calculator 📊
Welcome to the VCF Statistics Calculator, a Rust-based tool designed to compute Watterson's Theta (θ) and Pi (π) (and others, coming soon) for genomic regions defined in VCF (Variant Call Format) files.
Features ✨
- Calculate Genetic Diversity Metrics: Compute Watterson's Theta (θ) and Pi (π) for specified regions.
- Haplotype Group Analysis: Separate calculations for haplotypes with and without a structural variant class (such as inversions).
- Flexible Input Handling: Supports configuration via TSV files for multiple regions and haplotype groupings (e.g., by presence of structural variant).
- Filtering: Filter variants based on genotype quality (GQ) scores and predefined genomic masks.
- Output: Generates CSV files with statistical metrics for each genomic region.
Background 🧬
This tool processes VCF files to calculate Watterson's Theta (θ) and Pi (π), metrics for understanding genetic diversity within genomic regions. By using a TSV configuration file, users can define multiple regions and categorize haplotypes based on SV (e.g. inversion) statuses. This allows for distinguishing between inverted and non-inverted haplotypes across different regions and samples. Note that if you are using VCFs, all sites which differ between samples must be included.
Metrics:
- Watterson's Theta (θ): Based on the number of segregating (polymorphic) sites.
- Pi (π): Measures nucleotide diversity, based on pairwise per-site nucleotide differences.
Installation 🛠️
The easiest way to install the project is to run this command in your terminal:
curl -sSL https://github.com/ScottSauers/ferromic/raw/main/install.sh | bash
Or, you can download the binary from the releases section.
Or, make sure you have Rust and Cargo installed, and you can clone the repository and build the project.
Usage 🚀
Command-Line Arguments
vcf_stats_calculator -v <VCF_FOLDER> \
-c <CONFIG_FILE> \
-o <OUTPUT_CSV> \
--min_gq <MIN_GQ> \
--mask_file <MASK_FILE> \
-h <CHR> \
-r <REGION>
Parameters:
-v,--vcf_folder: (Required) Path to the directory containing VCF files.-c,--config_file: (Optional) Path to the TSV configuration file defining regions and haplotype groupings.-o,--output_file: (Optional) Path for the output CSV file containing statistical results. Defaults tooutput.csvif not specified.--min_gq: (Optional) Minimum genotype quality (GQ) Phred score for filtering variants. Defaults to30.--mask_file: (Optional) Path to the BED file specifying genomic regions to mask (filter out).-h,--chr: (Optional) Chromosome name to process when not using a config file.-r,--region: (Optional) Specific region to process within the chromosome, in the formatstart-end(e.g.,10732039-23685112).
Notes:
- Either
--config_fileor both--chrand--regionmust be provided. - When using
--config_file, the tool can process multiple regions and haplotype groupings as defined in the TSV. - When not using a config file, the tool will process the specified chromosome and region and output results to the console.
Input Files
VCF File 🧬
- Format: VCF v4.2
- Contents: Variant data including positions, alleles, and genotype information for multiple samples.
- Genotype Format: Must include
GT(genotype) andGQ(genotype quality) fields.
Example:
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred scaled genotype quality computed by whatshap genotyping algorithm.">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr1 1500 . A T . PASS . GT:GQ 0|0:35 0|1:40
TSV Configuration File 📋
- Purpose: Defines multiple genomic regions and specifies haplotype groupings based on inversion statuses.
- Structure:
- Columns:
seqnames: Chromosome name (e.g.,chr1).start: Start position of the region.end: End position of the region.- Sample Columns: Genotype information for each sample in the format
0|1,1|0,0|0,1|1, etc.
- Columns:
Example:
seqnames start end POS orig_ID verdict categ NA19434 HG00036 HG00191 ...
chr1 13004251 13122531 13113384 chr1-13113384-INV-62181 pass inv 1|1 1|1 1|1 ...
Notes:
- Genotypes beyond the standard
0|0,0|1,1|0,1|1(e.g.,0|1_lowconf) will be used only for the "unfiltered" outputs. - Haplotype groupings (presence or absence) are determined by the values in the genotype columns, indicating, e.g., inversion (
1) or direct (0) haplotypes.
Mask File 🛡️
- Format: BED
- Contains: Genomic regions to exclude from analysis (variants within the regions will betreated similarly to variants with low GQ scores).
- Structure:
- Three columns:
chromosome,start,end.
- Three columns:
Example:
chr1 0 77102
chr1 88113 190752
...
Output File 📈
-
Format: CSV
-
Filename: As specified by the
--output_fileparameter. -
Headers:
chr,region_start,region_end,0_sequence_length,1_sequence_length,0_sequence_length_adjusted,1_sequence_length_adjusted,0_segregating_sites,1_segregating_sites,0_w_theta,1_w_theta,0_pi,1_pi,0_segregating_sites_filtered,1_segregating_sites_filtered,0_w_theta_filtered,1_w_theta_filtered,0_pi_filtered,1_pi_filtered,0_num_hap_no_filter,1_num_hap_no_filter,0_num_hap_filter,1_num_hap_filter,inversion_freq_no_filter,inversion_freq_filter -
Column Descriptions:
chr: Chromosome name.region_start: Start position of the region.region_end: End position of the region.0_sequence_length: Total length of the sequence for haplotype group0.1_sequence_length: Total length of the sequence for haplotype group1.0_sequence_length_adjusted: Adjusted sequence length for haplotype group0after filtering.1_sequence_length_adjusted: Adjusted sequence length for haplotype group1after filtering.0_segregating_sites: Number of segregating sites (unfiltered) for haplotype group0.1_segregating_sites: Number of segregating sites (unfiltered) for haplotype group1.0_w_theta: Watterson's Theta (unfiltered) for haplotype group0.1_w_theta: Watterson's Theta (unfiltered) for haplotype group1.0_pi: Pi (unfiltered) for haplotype group0.1_pi: Pi (unfiltered) for haplotype group1.0_segregating_sites_filtered: Segregating sites for haplotype group0.1_segregating_sites_filtered: Segregating sites for haplotype group1.0_w_theta_filtered: Watterson's Theta for haplotype group0.1_w_theta_filtered: Watterson's Theta for haplotype group1.0_pi_filtered: Pi for haplotype group0.1_pi_filtered: Pi for haplotype group1.0_num_hap_no_filter: Number of haplotypes for group0before filtering.1_num_hap_no_filter: Number of haplotypes for group1before filtering.0_num_hap_filter: Number of haplotypes for group0.1_num_hap_filter: Number of haplotypes for group1.inversion_freq_no_filter: Allele frequency of inversion (1) before filtering.inversion_freq_filter: Allele frequency of inversion (1).
-
Special Values:
θ = 0: No segregating sites; no genetic variation observed.θ = Infinity (inf): Insufficient haplotypes or zero-length region; metrics undefined.π = 0: No nucleotide differences.π = Infinity (inf): Insufficient data; metrics undefined.
Filtering Mechanisms 🔍
Genotype Quality (GQ) Filtering
- Purpose: Exclude variants with low genotype quality.
- Mechanism:
- If any sample within a variant has a GQ score below the specified
--min_gqthreshold, the entire variant is excluded from filtered (but not unfiltered) analyses. - Variants passing the GQ filter are included in both unfiltered and filtered analyses.
- If any sample within a variant has a GQ score below the specified
Genotype Matching
- Purpose: Only exact genotype matches (
0|0,0|1,1|0,1|1) are included in filtered analyses. - Mechanism:
- Genotypes not strictly matching the four expected formats (e.g.,
0|1_lowconf) are considered missing data and excluded from filtered analyses. - Unfiltered analyses include all genotypes that can be parsed into valid formats based on the first three characters.
- Genotypes not strictly matching the four expected formats (e.g.,
Masking
- Purpose: Exclude entire genomic regions from analysis based on predefined masks.
- Mechanism:
- Regions specified in the BED mask file are treated similarly to low GQ variants and excluded from filtered (but not unfiltered) analyses.
- Sequence lengths are adjusted to account for masked regions in statistic calculations
Common Warnings and Errors ⚠️
-
Missing Samples: If certain samples defined in the configuration file are not found in the VCF, a warning is displayed with the missing samples.
-
Invalid Genotypes: Genotypes not conforming to the expected formats (
0|0,0|1,1|0,1|1) will be considered missing data. The number and percentage of invalid genotypes encountered will be shown. -
Multi-allelic Sites: Multi-allelic variants are not supported.
-
No Variants Found: If no variants are found within the specified region or all variants are filtered out, a warning will be printed.
Examples 🧪
Running the Tool with All Parameters
cargo run --release --bin vcf_stats_calculator \
-v ../vcfs \
-c ../config/regions.tsv \
-o ../results/output_stats.csv \
--min_gq 30 \
--mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
Running the Tool Without a Configuration File
If you prefer to calculate statistics for a specific chromosome and region without using a configuration file, you can run the tool with the --chr and --region flags. Note: In this mode, the results will be printed to the console rather than written to a CSV file.
cargo run --release --bin vcf_stats_calculator \
-v ../vcfs \
-c chr22 \
-r 10732039-23685112 \
--min_gq 30 \
--mask_file ../masks/hardmask.hg38.v4_acroANDsdOnly.over99.bed
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ferromic-0.1.0.tar.gz.
File metadata
- Download URL: ferromic-0.1.0.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c06deb846635458e6ec1da26e97899045b2bf52646557c32ccded81bdf9854
|
|
| MD5 |
1f0225d1c1f62964744e248b9bff260a
|
|
| BLAKE2b-256 |
81e7561758507334c6eccdabffc64c655036449fe37643e642d4c0f4eaa6f43e
|
File details
Details for the file ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl.
File metadata
- Download URL: ferromic-0.1.0-cp313-cp313-manylinux_2_39_aarch64.whl
- Upload date:
- Size: 903.1 kB
- Tags: CPython 3.13, manylinux: glibc 2.39+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4268bfa76ef50dd706bd549f09487a6ffaa5f22f0eef0ce788a1b265b63a34b0
|
|
| MD5 |
3b92ecc291cf984fe34753459588e2e2
|
|
| BLAKE2b-256 |
b9e223e871f8fe5b380e8c0436883d96132df2a5bd2f45e938f667659ecd9626
|
File details
Details for the file ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ferromic-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 210.2 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6874ca59bf84feeea19bd96d72c3c14b9930a9d2b3856c1b359d008b76a214b
|
|
| MD5 |
5abcb6ccff65dfbf2a94bba5e859e674
|
|
| BLAKE2b-256 |
3623cadabd3a7c15c1b6ea829df37eead0293bb4e3390aff0b3bb54ec4ee6a58
|