Skip to main content

Annotate genetic inheritance models in variant files

Project description

DOI

GENMOD is a simple to use command line tool for annotating and analyzing genomic variations in the VCF file format. It can annotate genetic patterns of inheritance in vcf:s with single or multiple families of arbitrary size.

The tools in the genmod suite are:

  • genmod annotate, for annotating inheritance patterns, frequencies, cadd scores etc.

  • genmod build_annotation, for building new annotation sets from different sources

  • genmod analyze, do a basic analysis of the annotated variants in a vcf file

  • genmod summarize, to get some basic statistics of the annotated variants in a vcf file

Installation:

GENMOD works with Python 2.7 and Python v3.2 and above

pip install genmod

or

git clone https://github.com/moonso/genmod.git
cd genmod
python setup.py install

USAGE:

genmod annotate

genmod annotate variant_file.vcf --family_file ped_file

This will print a new vcf to standard out with all variants annotated according to the statements below. All individuals described in the ped file must be present in the vcf file

See examples in the folder genmod/examples.

From version 1.9 genmod can split multiallelic calls in vcf:s, use flag -split/–split_variants.

To get an example of how splitting variants work, run genmod on the file examples/multi_allele_example.vcf with the dominant trio. That is: genmod annotate examples/multi_allele_example.vcf -f examples/dominant_trio.ped -split

Compare the result when not using the -split flag.

Genmod is distributed with a annotation database that is built from the refGene data. If the user wants to build a new annotation set use the command below:

genmod build_annotation [--type] annotation_file

Each variant in the VCF-file will be annotated with which genetic models that are followed in the family if a family file (ped file) is provided.

The genetic models that are checked are the following:

  • Autsomal Recessive, denoted ‘AR_hom’

  • Autsomal Recessive denovo, denoted ‘AR_hom_dn’

  • Autsomal Dominant, ‘AD’

  • Autsomal Dominant denovo, ‘AD_dn’

  • Autosomal Compound Heterozygote, ‘AR_comp’

  • X-linked dominant, ‘XD’

  • X-linked dominant de novo, ‘XD_dn’

  • X-linked Recessive, ‘XR’

  • X-linked Recessive de novo, ‘XR_dn’

Se description of how genetic models are annotated in the section Conditions for genetic models below.

It is possible to run without a family file, in this case all variants will be annotated with which region(s) they belong to, and if other annotation files are provided(1000G, CADD scores etc.) the variants will get the proper values from these.

Variant Effect Predictor(vep) annotations are supported, use the --vep-flag if variants are already annotated with vep.

GENMOD will add entrys to the INFO column for the given VCF file depending on what information is given.

If --vep is NOT provided:

  • Annotation Comma separated list with features overlapped in the annotation file

If --vep is used Annotation will not be annotated since all information is in the vep entry.

If a pedigree file is provided the following will be added:

  • GeneticModels A comma separated list with which genetic models that are followed in each family described in the ped file. Annotation are separated with pipes on the form GeneticModels=fam_id_1:AR_hom, fam_id_2:AR_comp|AD_dn etc..

  • Compounds Comma separated list with compound pairs(if any) for each family. These are described like ‘CHR_POS_REF_ALT’

  • ModelScore Model Score, a phred-score based on the genotype qualities to describe the uncertainty of the genetic model in each family

Also a line for logging is added in the vcf header with the id genmod, here the date of run, version and command line arguments are printed.

  • Compound heterozygote inheritance pattern will be checked if two variants are exonic (or in canonical splice sites) and if they reside in the same gene.

  • If compounds should be checked in the whole gene (including introns) use --whole_gene

  • GENMOD supports phased data, use the -phased flag. Data should follow the GATK way of phasing.

All annotations will be present only if they have a value.

  • GENMOD can annotate the variants with 1000 genome frequencies. Use the flag -kg/--thousand_g path/to/bgzipped/thousand_genomes.vcf.gz

  • GENMOD also supports annotation of frequencies from the ExAC. Use the flag --exac path/to/bgzipped/ExAC_file.vcf.gz

  • Annotate with CADD scores, use -cadd/--cadd_file path/to/huge_cadd_file.tsv.gz.

  • There several cadd files with different variant sets to cover as much as possible.

    • One with all 1000 genomes positions (this one include some indels), if annotation with this one use -c1kg/--cadd_1000_g path/to/CADD_1000g.txt.gz.

    • One with all variants from the ESP6500 dataset. If annotation with this one use --cadd_esp path/to/CADD_ESP.tsv.gz.

    • One with all variants from the ExAC dataset. If annotation with this one use --cadd_exac path/to/CADD_ExAC.tsv.gz.

    • One with 12.3M InDels from the CADD resources. If annotation with this one use --cadd_indels path/to/CADD_InDels.txt.gz.

  • By default the relative cadd scores is annotated with ‘CADD=score’, there is also an alternative to annotate with the raw cadd scores using the --cadd_raw flag. In this case a info field ‘CADD_raw=score’.

  • If your VCF is already annotated with VEP, use -vep/--vep

  • If data is phased use -phased/--phased

  • If you want to allow compound pairs in intronic regions to use -gene/--whole_gene

  • If you want canonical splice site region to be bigger than 2 base pairs on each side of the exons, use -splice/--splice_padding <integer>

  • The -strict/--strict flag tells genmod to only annotate genetic models if they are proved by the data. If a variant is not called in a family member it will not be annotated.

genmod build_annotation

genmod build_annotation [--type] [-o/--outdir] annotation_file

The following file formats are supported for building new annotations:

  • bed

  • ccds

  • gtf

  • gene_pred

The user can also specify the amount of positions around exon boundaries that should be considered as splice sites. Use

--splice_padding INTEGER

genmod analyze

From version 1.6 there is also a tool for analyzing the variants annotated by genmod. This tool will look at all variants in a vcf and do an analysis based on which inheritance patterns they follow. The variants are then ranked based on the cadd scores, the highest ranked variants for each category is printed to screen and the full list for each category is printed to new vcf files. Run with:

genmod analyze path/to/file.vcf

For more information do

genmod analyze --help

genmod summarize

Tool to get basic statistics of the annotated in a vcf file. Run

genmod summarize --help

for more information.

Conditions for Genetic Models

Short explanation of genotype calls in VCF format:

Since we only look at humans, that are diploid, the genotypes represent what we see on both alleles in a single position. 0 represents the reference sequence, 1 is the first of the alternative alleles, 2 second alternative and so on. If no phasing has been done the genotype is an unordered pair on the form x/x, so 0/1 means that the individual is heterozygote in this given position with the reference base on one of the alleles and the first of the alternatives on the other. 2/2 means that we see the second of the alternatives on both alleles. Some chromosomes are only present in one copy in humans, here it is allowed to only use a single digit to show the genotype. A 0 would mean reference and 1 first of alternatives.

If phasing has been done the pairs are not unordered anymore and the delimiter is then changed to ‘|’, so one can be heterozygote in two ways; 0|1 or 1|0.

Autosomal Recessive

For this model individuals can be carriers so healthy individuals can be heterozygous. Both alleles need to have the variant for an individual to be sick so a healthy individual can not be homozygous alternative and a sick individual has to be homozygous alternative.

  • Affected individuals have to be homozygous alternative (hom. alt.)

  • Healthy individuals cannot be hom. alt.

  • Variant is considered de novo if both parents are genotyped and do not carry the variant

Autosomal Dominant

  • Affected individuals have to be heterozygous (het.)

  • Healthy individuals cannot have the alternative variant

  • Variant is considered de novo if both parents are genotyped and do not carry the variant

Autosomal Compound Heterozygote

This model includes pairs of exonic variants that are present within the same gene. The default behaviour of GENMOD is to look for compounds only in exonic/canonical splice sites. The reason for this is that since some genes have huge intronic regions the data will be drowned in compound pairs. If the user wants all variants in genes checked use the flag -gene/–whole_gene.

  1. Non-phased data:

    • Affected individuals have to be het. for both variants

    • Healthy individuals can be het. for one of the variants but cannot have both variants

    • Variant is considered de novo if only one or no variant is found in the parents

  2. Phased data:

    • All affected individuals have to be het. for both variants and the variants has to be on two different alleles

    • Healthy individuals can be heterozygous for one but cannot have both variants

    • If only one or no variant is found in parents it is considered de novo

X-Linked Dominant

These traits are inherited on the x-chromosome, of which men have one allele and women have two.

  • Variant has to be on chromosome X

  • Affected individuals have to be het. or hom. alt.

  • Healthy males cannot carry the variant

  • Healthy females can carry the variant (because of X inactivation)

  • If sex is male the variant is considered de novo if mother is genotyped and does not carry the variant

  • If sex is female variant is considered de novo if none of the parents carry the variant

X Linked Recessive

  • Variant has to be on chromosome X

  • Affected males have to be het. or hom. alt. (het is theoretically not possible in males, but can occur due to Pseudo Autosomal Regions).

  • Affected females have to be hom. alt.

  • Healthy females cannot be hom. alt.

  • Healthy males cannot carry the variant

  • If sex is male the variant is considered de novo if mother is genotyped and does not carry the variant

  • If sex is female variant is considered de novo if not both parents carry the variant

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genmod-2.0.4.tar.gz (10.8 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page