Ultima Genomics filtering scripts

Project description

ugbio_filtering

This module includes filtering python scripts for bioinformatics pipelines. It provides tools for variant filtering, model training, and quality control of variant calls.

Installation

To install the filtering module with all dependencies:

pip install ugbio_filtering

The tool with all required dependencies can be also used in docker image: ultimagenomics/ugbio_filtering

CLI Scripts

The filtering module provides several command-line tools for different stages of the variant filtering pipeline:

Variant Filtering

`filter_variants_pipeline`

Applies machine learning-based filtering to VCF files after GATK variant calling.

Purpose: Filter variants using trained models, blacklists, and custom annotation bed files to improve callset quality.

Usage:

filter_variants_pipeline \
  --input_file input.vcf.gz \
  --output_file filtered.vcf.gz \
  --model_file model.pkl \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--decision_threshold 30] \
  [--treat_multiallelics --ref_fasta reference.fa] \
  [--recalibrate_genotype] \
  [--overwrite_qual_tag] \
  [--limit_to_contigs chr1 chr2]

Key Parameters:

--input_file: Input VCF file (requires .tbi index)
--output_file: Output VCF file with filtering annotations
--model_file: Pickle file containing trained XGBoost model and transformer
--decision_threshold: Score threshold for filtering variants (default: 30)
--treat_multiallelics: Apply special handling for multiallelic sites
--recalibrate_genotype: Allow model to re-call genotypes

`filter_low_af_ratio_to_background`

Filters somatic variants based on allele frequency ratio to background.

Purpose: Remove variants with low AF ratio in GT ALT alleles compared to background, useful for somatic variant calling.

Usage:

filter_low_af_ratio_to_background \
  input.vcf.gz \
  output.vcf.gz \
  [--af_ratio_threshold 10] \
  [--af_ratio_threshold_h_indels 0] \
  [--tumor_vaf_threshold_h_indels 0] \
  [--new_filter LowAFRatioToBackground]

Key Parameters:

input.vcf.gz: Input VCF file
output.vcf.gz: Output VCF file
--af_ratio_threshold: AF ratio threshold for SNPs and non-h-indels (default: 10)
--af_ratio_threshold_h_indels: AF ratio threshold for h-indels (default: 0)
--tumor_vaf_threshold_h_indels: Tumor VAF threshold for h-indel filtering (default: 0)
--new_filter: Name of the FILTER tag to add (default: LowAFRatioToBackground)

Model Training

`train_models_pipeline`

Trains machine learning models for variant filtering using prepared ground truth data.

Purpose: Train XGBoost models on labeled training data to distinguish true variants from false positives.

Usage:

train_models_pipeline \
  --train_dfs train1.h5 train2.h5 \
  --test_dfs test1.h5 test2.h5 \
  --output_file_prefix model_output \
  [--gt_type exact|approximate] \
  [--vcf_type single_sample|deep_variant|cnv] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--verbosity INFO]

Key Parameters:

--train_dfs: Training HDF5 files (output from training_prep_pipeline)
--test_dfs: Test HDF5 files for model evaluation
--output_file_prefix: Prefix for output .pkl (model) and .h5 (results) files
--gt_type: Ground truth type - "exact" or "approximate" (default: exact)
--vcf_type: VCF type - "single_sample", "deep_variant", or "cnv" (default: single_sample)
--custom_annotations: Additional INFO annotations to include in training

`training_prep_pipeline`

Prepares training data by comparing variant calls to ground truth.

Purpose: Generate labeled training datasets (true positives, false positives, false negatives) for model training.

Usage:

training_prep_pipeline \
  --call_vcf calls.vcf.gz \
  --gt_type exact|approximate \
  --output_prefix training_data \
  [--base_vcf truth.vcf.gz] \
  [--reference reference.fa] \
  [--reference_sdf reference.sdf] \
  [--hcr high_confidence.bed] \
  [--blacklist blacklist.pkl] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--contigs_to_read chr1 chr2] \
  [--contig_for_test chr3] \
  [--ignore_genotype] \
  [--verbosity INFO]

Key Parameters:

--call_vcf: VCF file with variant calls to evaluate
--gt_type: Ground truth type - "exact" (requires base_vcf) or "approximate"
--base_vcf: Truth VCF file (required for exact ground truth)
--reference: Reference FASTA file prefix (requires .fai index)
--reference_sdf: Reference SDF folder (RTG format). If not provided, uses <reference>.sdf
--hcr: High confidence regions BED file
--output_prefix: Prefix for output HDF5 files (train and test sets)
--contig_for_test: Chromosome to use as test set
--ignore_genotype: Ignore genotype when comparing to ground truth

`training_prep_cnv_pipeline`

Prepares training data specifically for CNV filtering models.

Purpose: Generate labeled CNV training datasets by comparing CNV calls to truth set.

Usage:

training_prep_cnv_pipeline \
  --call_vcf cnv_calls.vcf.gz \
  --base_vcf cnv_truth.vcf.gz \
  --output_prefix cnv_training_data \
  [--hcr high_confidence.bed] \
  [--custom_annotations ANNOTATION1 ANNOTATION2] \
  [--train_fraction 0.25] \
  [--ignore_cnv_type] \
  [--skip_collapse] \
  [--verbosity INFO]

Key Parameters:

--call_vcf: CNV call VCF file
--base_vcf: CNV truth VCF file
--output_prefix: Prefix for output HDF5 files
--train_fraction: Fraction of CNVs for training, rest for testing (default: 0.25)
--ignore_cnv_type: Ignore CNV type when matching to truth
--skip_collapse: Skip collapsing variants before comparison

Typical Workflows

Training and Applying a Filtering Model

Prepare training data:

training_prep_pipeline \
  --call_vcf calls.vcf.gz \
  --base_vcf truth.vcf.gz \
  --gt_type exact \
  --reference ref.fa \
  --hcr hcr.bed \
  --output_prefix training

Train the model:

train_models_pipeline \
  --train_dfs training_train.h5 \
  --test_dfs training_test.h5 \
  --output_file_prefix model

Apply filtering:

filter_variants_pipeline \
  --input_file new_calls.vcf.gz \
  --model_file model.pkl \
  --output_file filtered_calls.vcf.gz

Dependencies

The filtering module depends on the following external tools: bcftools, samtools, GATK. RTG tools and picard

Project details

Release history Release notifications | RSS feed

1.24.1

Apr 28, 2026

1.24.0

Apr 23, 2026

1.23.0

Apr 16, 2026

This version

1.22.2

Mar 22, 2026

1.22.1

Mar 22, 2026

1.22.0

Mar 19, 2026

1.21.4

Mar 1, 2026

1.21.3

Mar 1, 2026

1.21.2

Feb 19, 2026

1.21.1

Feb 17, 2026

1.21.0

Feb 17, 2026

1.20.0

Jan 25, 2026

1.19.0

Jan 4, 2026

1.18.0

Dec 24, 2025

1.17.2

Nov 23, 2025

1.17.1

Nov 20, 2025

1.17.0

Nov 20, 2025

1.16.2

Oct 30, 2025

1.16.1

Oct 21, 2025

1.16.0

Oct 21, 2025

1.15.0

Sep 14, 2025

1.14.0

Aug 28, 2025

1.13.2

Aug 21, 2025

1.13.1

Aug 17, 2025

1.13.0

Aug 7, 2025

1.12.0

Jul 17, 2025

1.11.0

Jun 23, 2025

1.10.2

May 28, 2025

1.10.1

May 22, 2025

1.10.0

May 21, 2025

1.8.0

Mar 18, 2025

1.7.0

Mar 17, 2025

1.6.1

Feb 18, 2025

1.6.0

Feb 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_filtering-1.22.2.tar.gz (55.5 kB view details)

Uploaded Mar 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ugbio_filtering-1.22.2-py3-none-any.whl (71.8 kB view details)

Uploaded Mar 22, 2026 Python 3

File details

Details for the file ugbio_filtering-1.22.2.tar.gz.

File metadata

Download URL: ugbio_filtering-1.22.2.tar.gz
Upload date: Mar 22, 2026
Size: 55.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_filtering-1.22.2.tar.gz
Algorithm	Hash digest
SHA256	`a79bcb479ef82122a3525a8445c8a77a635f253e62ae41603def5af10be27b7d`
MD5	`b970a8e62e9fbfa7795da4e0150f1023`
BLAKE2b-256	`991eff8954ae9520a934fdf11e787b3add89d29f52e1c95e675986450cbcd3a2`

See more details on using hashes here.

File details

Details for the file ugbio_filtering-1.22.2-py3-none-any.whl.

File metadata

Download URL: ugbio_filtering-1.22.2-py3-none-any.whl
Upload date: Mar 22, 2026
Size: 71.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_filtering-1.22.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5e06087d549f1924d2abb83701bab9f30d1fc8d42e45a4f11ebb2bbf6f0ab5f`
MD5	`7b5ab8e4b919dd30d4966f3b77fa652a`
BLAKE2b-256	`c74d3b9d40964060eefae217468490993587f3a7cfb34c5fec49babc6123d41a`

See more details on using hashes here.

ugbio-filtering 1.22.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ugbio_filtering

Installation

CLI Scripts

Variant Filtering

`filter_variants_pipeline`

`filter_low_af_ratio_to_background`

Model Training

`train_models_pipeline`

`training_prep_pipeline`

`training_prep_cnv_pipeline`

Typical Workflows

Training and Applying a Filtering Model

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes