Skip to main content

Ultima Genomics filtering scripts

Project description

ugbio_filtering

This module includes filtering python scripts for bioinformatics pipelines. It provides tools for variant filtering, model training, and quality control of variant calls.

Installation

To install the filtering module with all dependencies:

pip install ugbio_filtering

The tool with all required dependencies can be also used in docker image: ultimagenomics/ugbio_filtering

CLI Scripts

The filtering module provides several command-line tools for different stages of the variant filtering pipeline:

Variant Filtering

filter_variants_pipeline

Applies machine learning-based filtering to VCF files after GATK variant calling.

Purpose: Filter variants using trained models, blacklists, and custom annotation bed files to improve callset quality.

Usage:

filter_variants_pipeline \
  --input_file input.vcf.gz \
  --output_file filtered.vcf.gz \
  --model_file model.pkl \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--decision_threshold 30] \
  [--treat_multiallelics --ref_fasta reference.fa] \
  [--recalibrate_genotype] \
  [--overwrite_qual_tag] \
  [--limit_to_contigs chr1 chr2]

Key Parameters:

  • --input_file: Input VCF file (requires .tbi index)
  • --output_file: Output VCF file with filtering annotations
  • --model_file: Pickle file containing trained XGBoost model and transformer
  • --decision_threshold: Score threshold for filtering variants (default: 30)
  • --treat_multiallelics: Apply special handling for multiallelic sites
  • --recalibrate_genotype: Allow model to re-call genotypes

filter_low_af_ratio_to_background

Filters somatic variants based on allele frequency ratio to background.

Purpose: Remove variants with low AF ratio in GT ALT alleles compared to background, useful for somatic variant calling.

Usage:

filter_low_af_ratio_to_background \
  input.vcf.gz \
  output.vcf.gz \
  [--af_ratio_threshold 10] \
  [--af_ratio_threshold_h_indels 0] \
  [--tumor_vaf_threshold_h_indels 0] \
  [--new_filter LowAFRatioToBackground]

Key Parameters:

  • input.vcf.gz: Input VCF file
  • output.vcf.gz: Output VCF file
  • --af_ratio_threshold: AF ratio threshold for SNPs and non-h-indels (default: 10)
  • --af_ratio_threshold_h_indels: AF ratio threshold for h-indels (default: 0)
  • --tumor_vaf_threshold_h_indels: Tumor VAF threshold for h-indel filtering (default: 0)
  • --new_filter: Name of the FILTER tag to add (default: LowAFRatioToBackground)

Model Training

train_models_pipeline

Trains machine learning models for variant filtering using prepared ground truth data.

Purpose: Train XGBoost models on labeled training data to distinguish true variants from false positives.

Usage:

train_models_pipeline \
  --train_dfs train1.h5 train2.h5 \
  --test_dfs test1.h5 test2.h5 \
  --output_file_prefix model_output \
  [--gt_type exact|approximate] \
  [--vcf_type single_sample|deep_variant|cnv] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--verbosity INFO]

Key Parameters:

  • --train_dfs: Training HDF5 files (output from training_prep_pipeline)
  • --test_dfs: Test HDF5 files for model evaluation
  • --output_file_prefix: Prefix for output .pkl (model) and .h5 (results) files
  • --gt_type: Ground truth type - "exact" or "approximate" (default: exact)
  • --vcf_type: VCF type - "single_sample", "deep_variant", or "cnv" (default: single_sample)
  • --custom_annotations: Additional INFO annotations to include in training

training_prep_pipeline

Prepares training data by comparing variant calls to ground truth.

Purpose: Generate labeled training datasets (true positives, false positives, false negatives) for model training.

Usage:

training_prep_pipeline \
  --call_vcf calls.vcf.gz \
  --gt_type exact|approximate \
  --output_prefix training_data \
  [--base_vcf truth.vcf.gz] \
  [--reference reference.fa] \
  [--reference_sdf reference.sdf] \
  [--hcr high_confidence.bed] \
  [--blacklist blacklist.pkl] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--contigs_to_read chr1 chr2] \
  [--contig_for_test chr3] \
  [--ignore_genotype] \
  [--verbosity INFO]

Key Parameters:

  • --call_vcf: VCF file with variant calls to evaluate
  • --gt_type: Ground truth type - "exact" (requires base_vcf) or "approximate"
  • --base_vcf: Truth VCF file (required for exact ground truth)
  • --reference: Reference FASTA file prefix (requires .fai index)
  • --reference_sdf: Reference SDF folder (RTG format). If not provided, uses <reference>.sdf
  • --hcr: High confidence regions BED file
  • --output_prefix: Prefix for output HDF5 files (train and test sets)
  • --contig_for_test: Chromosome to use as test set
  • --ignore_genotype: Ignore genotype when comparing to ground truth

training_prep_cnv_pipeline

Prepares training data specifically for CNV filtering models.

Purpose: Generate labeled CNV training datasets by comparing CNV calls to truth set.

Usage:

training_prep_cnv_pipeline \
  --call_vcf cnv_calls.vcf.gz \
  --base_vcf cnv_truth.vcf.gz \
  --output_prefix cnv_training_data \
  [--hcr high_confidence.bed] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--train_fraction 0.25] \
  [--ignore_cnv_type] \
  [--skip_collapse] \
  [--verbosity INFO]

Key Parameters:

  • --call_vcf: CNV call VCF file
  • --base_vcf: CNV truth VCF file
  • --output_prefix: Prefix for output HDF5 files
  • --train_fraction: Fraction of CNVs for training, rest for testing (default: 0.25)
  • --ignore_cnv_type: Ignore CNV type when matching to truth
  • --skip_collapse: Skip collapsing variants before comparison

Typical Workflows

Training and Applying a Filtering Model

  1. Prepare training data:

    training_prep_pipeline \
      --call_vcf calls.vcf.gz \
      --base_vcf truth.vcf.gz \
      --gt_type exact \
      --reference ref.fa \
      --hcr hcr.bed \
      --output_prefix training
    
  2. Train the model:

    train_models_pipeline \
      --train_dfs training_train.h5 \
      --test_dfs training_test.h5 \
      --output_file_prefix model
    
  3. Apply filtering:

    filter_variants_pipeline \
      --input_file new_calls.vcf.gz \
      --model_file model.pkl \
      --output_file filtered_calls.vcf.gz
    

Dependencies

The filtering module depends on the following external tools: bcftools, samtools, GATK. RTG tools and picard

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_filtering-1.22.2.tar.gz (55.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ugbio_filtering-1.22.2-py3-none-any.whl (71.8 kB view details)

Uploaded Python 3

File details

Details for the file ugbio_filtering-1.22.2.tar.gz.

File metadata

  • Download URL: ugbio_filtering-1.22.2.tar.gz
  • Upload date:
  • Size: 55.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_filtering-1.22.2.tar.gz
Algorithm Hash digest
SHA256 a79bcb479ef82122a3525a8445c8a77a635f253e62ae41603def5af10be27b7d
MD5 b970a8e62e9fbfa7795da4e0150f1023
BLAKE2b-256 991eff8954ae9520a934fdf11e787b3add89d29f52e1c95e675986450cbcd3a2

See more details on using hashes here.

File details

Details for the file ugbio_filtering-1.22.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ugbio_filtering-1.22.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d5e06087d549f1924d2abb83701bab9f30d1fc8d42e45a4f11ebb2bbf6f0ab5f
MD5 7b5ab8e4b919dd30d4966f3b77fa652a
BLAKE2b-256 c74d3b9d40964060eefae217468490993587f3a7cfb34c5fec49babc6123d41a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page