Ultima Genomics filtering scripts
Project description
ugbio_filtering
This module includes filtering python scripts for bioinformatics pipelines. It provides tools for variant filtering, model training, and quality control of variant calls.
Installation
To install the filtering module with all dependencies:
pip install ugbio_filtering
The tool with all required dependencies can be also used in docker image: ultimagenomics/ugbio_filtering
CLI Scripts
The filtering module provides several command-line tools for different stages of the variant filtering pipeline:
Variant Filtering
filter_variants_pipeline
Applies machine learning-based filtering to VCF files after GATK variant calling.
Purpose: Filter variants using trained models, blacklists, and custom annotation bed files to improve callset quality.
Usage:
filter_variants_pipeline \
--input_file input.vcf.gz \
--output_file filtered.vcf.gz \
--model_file model.pkl \
[--custom_annotations ANNOTATION1 ANNOTATION2] \
[--decision_threshold 30] \
[--treat_multiallelics --ref_fasta reference.fa] \
[--recalibrate_genotype] \
[--overwrite_qual_tag] \
[--limit_to_contigs chr1 chr2]
Key Parameters:
--input_file: Input VCF file (requires .tbi index)--output_file: Output VCF file with filtering annotations--model_file: Pickle file containing trained XGBoost model and transformer--decision_threshold: Score threshold for filtering variants (default: 30)--treat_multiallelics: Apply special handling for multiallelic sites--recalibrate_genotype: Allow model to re-call genotypes
filter_low_af_ratio_to_background
Filters somatic variants based on allele frequency ratio to background.
Purpose: Remove variants with low AF ratio in GT ALT alleles compared to background, useful for somatic variant calling.
Usage:
filter_low_af_ratio_to_background \
input.vcf.gz \
output.vcf.gz \
[--af_ratio_threshold 10] \
[--af_ratio_threshold_h_indels 0] \
[--tumor_vaf_threshold_h_indels 0] \
[--new_filter LowAFRatioToBackground]
Key Parameters:
input.vcf.gz: Input VCF fileoutput.vcf.gz: Output VCF file--af_ratio_threshold: AF ratio threshold for SNPs and non-h-indels (default: 10)--af_ratio_threshold_h_indels: AF ratio threshold for h-indels (default: 0)--tumor_vaf_threshold_h_indels: Tumor VAF threshold for h-indel filtering (default: 0)--new_filter: Name of the FILTER tag to add (default: LowAFRatioToBackground)
Model Training
train_models_pipeline
Trains machine learning models for variant filtering using prepared ground truth data.
Purpose: Train XGBoost models on labeled training data to distinguish true variants from false positives.
Usage:
train_models_pipeline \
--train_dfs train1.h5 train2.h5 \
--test_dfs test1.h5 test2.h5 \
--output_file_prefix model_output \
[--gt_type exact|approximate] \
[--vcf_type single_sample|deep_variant|cnv] \
[--custom_annotations ANNOTATION1 ANNOTATION2] \
[--verbosity INFO]
Key Parameters:
--train_dfs: Training HDF5 files (output from training_prep_pipeline)--test_dfs: Test HDF5 files for model evaluation--output_file_prefix: Prefix for output .pkl (model) and .h5 (results) files--gt_type: Ground truth type - "exact" or "approximate" (default: exact)--vcf_type: VCF type - "single_sample", "deep_variant", or "cnv" (default: single_sample)--custom_annotations: Additional INFO annotations to include in training
training_prep_pipeline
Prepares training data by comparing variant calls to ground truth.
Purpose: Generate labeled training datasets (true positives, false positives, false negatives) for model training.
Usage:
training_prep_pipeline \
--call_vcf calls.vcf.gz \
--gt_type exact|approximate \
--output_prefix training_data \
[--base_vcf truth.vcf.gz] \
[--reference reference.fa] \
[--reference_sdf reference.sdf] \
[--hcr high_confidence.bed] \
[--blacklist blacklist.pkl] \
[--custom_annotations ANNOTATION1 ANNOTATION2] \
[--contigs_to_read chr1 chr2] \
[--contig_for_test chr3] \
[--ignore_genotype] \
[--verbosity INFO]
Key Parameters:
--call_vcf: VCF file with variant calls to evaluate--gt_type: Ground truth type - "exact" (requires base_vcf) or "approximate"--base_vcf: Truth VCF file (required for exact ground truth)--reference: Reference FASTA file prefix (requires .fai index)--reference_sdf: Reference SDF folder (RTG format). If not provided, uses<reference>.sdf--hcr: High confidence regions BED file--output_prefix: Prefix for output HDF5 files (train and test sets)--contig_for_test: Chromosome to use as test set--ignore_genotype: Ignore genotype when comparing to ground truth
training_prep_cnv_pipeline
Prepares training data specifically for CNV filtering models.
Purpose: Generate labeled CNV training datasets by comparing CNV calls to truth set.
Usage:
training_prep_cnv_pipeline \
--call_vcf cnv_calls.vcf.gz \
--base_vcf cnv_truth.vcf.gz \
--output_prefix cnv_training_data \
[--hcr high_confidence.bed] \
[--custom_annotations ANNOTATION1 ANNOTATION2] \
[--train_fraction 0.25] \
[--ignore_cnv_type] \
[--skip_collapse] \
[--verbosity INFO]
Key Parameters:
--call_vcf: CNV call VCF file--base_vcf: CNV truth VCF file--output_prefix: Prefix for output HDF5 files--train_fraction: Fraction of CNVs for training, rest for testing (default: 0.25)--ignore_cnv_type: Ignore CNV type when matching to truth--skip_collapse: Skip collapsing variants before comparison
Typical Workflows
Training and Applying a Filtering Model
-
Prepare training data:
training_prep_pipeline \ --call_vcf calls.vcf.gz \ --base_vcf truth.vcf.gz \ --gt_type exact \ --reference ref.fa \ --hcr hcr.bed \ --output_prefix training
-
Train the model:
train_models_pipeline \ --train_dfs training_train.h5 \ --test_dfs training_test.h5 \ --output_file_prefix model
-
Apply filtering:
filter_variants_pipeline \ --input_file new_calls.vcf.gz \ --model_file model.pkl \ --output_file filtered_calls.vcf.gz
Dependencies
The filtering module depends on the following external tools: bcftools, samtools, GATK. RTG tools and picard
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ugbio_filtering-1.24.1.tar.gz.
File metadata
- Download URL: ugbio_filtering-1.24.1.tar.gz
- Upload date:
- Size: 55.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8a3d014bb81feecc8b927037acf715604f806e2ceefc18b6f698f74e0df0332
|
|
| MD5 |
94dc7bad911855eab92a0efdd1f72f00
|
|
| BLAKE2b-256 |
4775ed3fd307c94daeb0e1a275be7e0ece8dc078b76e5e481e85cb2b2080a845
|
File details
Details for the file ugbio_filtering-1.24.1-py3-none-any.whl.
File metadata
- Download URL: ugbio_filtering-1.24.1-py3-none-any.whl
- Upload date:
- Size: 71.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b86986f4218e52d820640132c82657d57369dbdcb7434403a18558e789812ee5
|
|
| MD5 |
e75487363b3cfc39cdef6f5e5474405f
|
|
| BLAKE2b-256 |
9f5a74906485a19dc9eebd60030af11a032bf6df391569b8403350585a1e763d
|