Skip to main content

Identification of trios and other close relationships in multisample VCF files

Project description

# VCFped
VCFped is a command line utility which quickly and confidently identifies trios and close pairwise relationships in multisample VCF files.
The algorithm does not rely on allele frequencies or other annotations, and is highly robust to inbreeding and poor quality variants.

## Installation and basic usage:
The simplest installation of VCFped is using pip:

>pip install vcfped

To run VCFped on a variant file *myvariants.vcf*, the basic command is

>vcfped myvariants.vcf

In most circumstances VCFped can be run unsupervised with default parameters. For overview of the options, use the -h option:

>vcfped -h
usage: vcfped.py [-h] [--quiet] [--version] [--no-gender] [--no-pairwise]
[--no-trio] [--all] [-o PREFIX]
[-v VARIABLES [VARIABLES ...]]
[-p PERCENTILES [PERCENTILES ...]] [-e EXACTMAX]
[-s SAMPLESIZE] [-d SAMPLESIZEAABB] [-t1 T1_THRESH]
[-t2 T2_THRESH] [-mz MZ_THRESH] [-po PO_THRESH]
[-female FEMALE_THRESH] [-male MALE_THRESH]
file


positional arguments:
file path to VCF file

optional arguments:
-h, --help show this help message and exit
--quiet do not print information on the screen
--version show program's version number and exit
--no-gender skip gender analysis
--no-pairwise skip pairwise analysis
--no-trio skip trio analysis
--all show results for all pairs/triples (not only the
inferred)
-o PREFIX prefix for output files
-v VARIABLES [VARIABLES ...]
quality variables to be used for filtering
-p PERCENTILES [PERCENTILES ...]
filtering percentile ranks
-e EXACTMAX if approx. line count exceeds this, apply random
sampling
-s SAMPLESIZE sample at least this many variant lines (if sampling)
-d SAMPLESIZEAABB sample at least this many lines where both 0/0 and 1/1
occur as genotypes (if sampling)
-t1 T1_THRESH threshold (%) for T1 score (AA + BB = AB)
-t2 T2_THRESH threshold (%) for T2 score (BB + BB = BB)
-mz MZ_THRESH threshold (%) for MZ score (IBS = 2 | neither is AA)
-po PO_THRESH threshold (%) for PO score (IBS > 0 | either is BB)
-female FEMALE_THRESH
lower limit (%) for female heterozygosity on X
-male MALE_THRESH upper limit (%) for male heterozygosity on X

---

## Introduction
Correct relationships are crucial for successful analysis of family-based sequencing, e.g. detection of de novo variants in trios. Traditional methods for relatedness inference are not easily adaptable to sequencing data, since several basic assumptions of these methods are not satisfied (e.g. independent marker loci, correct allele frequencies, non-censored data). VCFped sidesteps these assumptions by exploiting forced allele sharing between closely related individuals. In particular, the algorithm does not depend on allele frequencies or other variant annotations, making the program well suited as a quality control step prior to annotation and downstream analysis.

VCFped is written in Python and has been extensively tested on variant files from whole-exome sequencing (WES), whole-genome sequencing (WGS) and the TruSightOne gene panel from Illumina.

## Filtering strategy
The relatedness testing algorithms of VCFped depend on correct genotypes for the variants. To minimize the inclusion of wrong genotypes a semi-automated filtering strategy has been implemented, using the quality variables present in the variant file (e.g. QUAL, DP, GQ). The complete distributions of the available variables are computed when loading the file, and a stepwise filtering process is created based on percentiles of these distributions (see Figure 1). The default percentile ranks are 10, 30 and 50, but these can be modified using the *-p* option.

## Relatedness inference using forced allele sharing
For a few close relationships the genotype of one individual can be determined (under Mendelian inheritance) by the genotypes of others. VCFped exploits such forced allele sharing to identify trios, parent-child pairs and monozygotic twins. In the description below, *A* and *B* denote the labels of a diallelic locus, while *g<sub>i</sub>* is the genotype of individual *i*. For trio and pariwise relationships only autosomal loci are considered, while X-linked loci are used for gender inference.

### Trio relationships
To identify trios a combination of two tests are used. The first (*AA + BB = AB*) recognizes the three basic trio types: regular (parent-offspring), inverted (two siblings and one parent) and generational (child-parent-grandparent). The second (*BB + BB = BB*) discriminates regular trios from the other two types. Both test scores are computed for every cyclic ordering of each triple of samples in the input variant file.

#### *T1 score = freq(g<sub>3</sub> = AB | g<sub>1</sub> = AA, g<sub>2</sub> = BB or vice versa)*
What's computed: The conditional frequency of genotype AB in one sample given AA and BB in the two others. The AB genotype is forced for the child of a regular trio, the parent in an inverted trio, and the middle individual in a generational trio.
Default threshold: 90 % (modify with option -t1)

#### *T2 score = freq(g<sub>3</sub> = BB | g<sub>1</sub> = g<sub>2</sub> = BB)*
What's computed: The conditional frequency of genotype BB in one sample given BB in both the others. This is forced for the child of a regular trio, but also if individual 3 is a monozygotic twin (or a sample duplicate!) of one of the others.
Default threshold: 95 % (modify with option -t2)

### Pairwise relationships
For pairwise relatedness VCFped tests for monozygotic twins (MZ) and parent-offspring (PO), which both obey patterns of forced allele sharing. MZ twins always have both alleles identical by state (IBS), while parent-offspring are always IBS > 0. The conditions in the following expressions ensure that the scores are unaffected by the data censoring of VCF files.

#### *MZ score = freq(IBS = 2 | neither is AA)*
What's computed: The frequency of equal genotypes among all variants where both have a least one B.
Default threshold: 95 % (modify with option -mz)

#### *PO score = freq(IBS > 0 | either is BB)*
What's computed: The frequency of at least one shared allele, given that at least one is homozygous BB.
Default threshold: 99 % (modify with option -po)

### Gender prediction
VCFped predicts the gender of each sample by using variants on X (except pseudoautosomal regions):

#### *XHET = freq(AB | AB or BB)*
What's computed: The heterozygosity on X among all non-AA variants.
Default interpretation: *male* if XHET < 5 %, *female* if XHET > 25 % (modify with options -male and -female)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcfped-1.3.1.zip (587.6 kB view details)

Uploaded Source

File details

Details for the file vcfped-1.3.1.zip.

File metadata

  • Download URL: vcfped-1.3.1.zip
  • Upload date:
  • Size: 587.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for vcfped-1.3.1.zip
Algorithm Hash digest
SHA256 cf542edd7439455edeed25d6ede4babbf3fb28d40359cf2f9ad4a8ee99401f77
MD5 83a4824388af9485f26f1be01900164f
BLAKE2b-256 6fa9469ec9bea5de8b94964a1c186ef299cc8c2b01d2c51154f8378552351954

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page