Skip to main content

Ensemble genotyping of tandem repeats

Project description

EnsembleTR

EnsembleTR is a tool for ensemble Tandem Repeat (TR) calling. It takes one or more VCF files with TR genotypes for a panel of samples and outputs a consensus set of genotypes.

Installation

python3 setup.py install --user

Type EnsembleTR. You should see the help message appear.

Usage

To run EnsembleTR, use the following command

EnsembleTR --out output.vcf
           --ref ref.fa
           --vcfs vcf1.vcf,vcf2.vcf,...

Required parameters:

  • --vcfs <file.vcf,[file2.vcf]> Comma separated list of input VCF files
  • --ref Refererence genome (.fa)
  • --out Path to output VCF file

File formats

VCF (--vcfs)

Both zipped and unzipped VCF files are accepted as input. EnsembleTR can currently process VCF files generated by hipSTR, GangSTR, adVNTR, and ExpansionHunter.

FASTA Reference genome (--ref)

You must input a reference genome in FASTA format. This must be the same reference build used for TR calling in input files.

VCF (--out)

For more information on VCF file format, see the VCF spec. EnsembleTR output VCF file contains several fields that are described below.

INFO fields

INFO fields contain aggregated statistics about each TR. The following custom fields are added:

FIELD DESCRIPTION
START Start position of the TR
END End position of the TR
PERIOD Length of the repeat unit
RU Repeat motif
METHODS Methods that attempted to genotype this locus (AdVNTR, EH, HipSTR, GangSTR)

FORMAT fields

FORMAT fields contain information specific to each genotype call. The following custom fields are added:

FIELD DESCRIPTION
GT Genotype
GB Base pair difference from ref allele
NCOPY Genotype given in number of copies of the repeat motif
EXP Boolean showing if the genotype alleles were expanded
SCORE Score of the consensus call
GTS Method(s) that support the consensus call
ALS Number of times each bp difference was seen across all calls
INPUTS Raw calls

Score is calculated by aggregating quality information from calls that are getting merged at each locus.

Using statSTR on EnsembleTR files

You can use statSTR from TRTools to compute various per-locus statistics for EnsembleTR .VCF files.

For example, to compute per-locus allele frequency use the following command:

statSTR --vcf EnsembleTR_file.vcf.gz
        --vcftype hipstr
        --afreq
        --out EnsembleTR_per_locus_allele_frequency

Version II of EnsembleTR calls on samples from 1000 Genomes Project and H3Africa

Chromosome 1 VCF file and tbi file

Chromosome 2 VCF file and tbi file

Chromosome 3 VCF file and tbi file

Chromosome 4 VCF file and tbi file

Chromosome 5 VCF file and tbi file

Chromosome 6 VCF file and tbi file

Chromosome 7 VCF file and tbi file

Chromosome 8 VCF file and tbi file

Chromosome 9 VCF file and tbi file

Chromosome 10 VCF file and tbi file

Chromosome 11 VCF file and tbi file

Chromosome 12 VCF file and tbi file

Chromosome 13 VCF file and tbi file

Chromosome 14 VCF file and tbi file

Chromosome 15 VCF file and tbi file

Chromosome 16 VCF file and tbi file

Chromosome 17 VCF file and tbi file

Chromosome 18 VCF file and tbi file

Chromosome 19 VCF file and tbi file

Chromosome 20 VCF file and tbi file

Chromosome 21 VCF file and tbi file

Chromosome 22 VCF file and tbi file

Version II of reference SNP+TR haplotype panel for imputation of TR variants

Dataset description

Phased variants of 3,202 samples from the 1000 Genomes Project (1kGP).

TRs imputed from 3,202 1kGP samples.

Total 70,692,015 variants + 1,091,550 TR markers.

All the coordinates are based on the hg38 human reference genome.

Availability

Chromosome 1 VCF file and tbi file

Chromosome 2 VCF file and tbi file

Chromosome 3 VCF file and tbi file

Chromosome 4 VCF file and tbi file

Chromosome 5 VCF file and tbi file

Chromosome 6 VCF file and tbi file

Chromosome 7 VCF file and tbi file

Chromosome 8 VCF file and tbi file

Chromosome 9 VCF file and tbi file

Chromosome 10 VCF file and tbi file

Chromosome 11 VCF file and tbi file

Chromosome 12 VCF file and tbi file

Chromosome 13 VCF file and tbi file

Chromosome 14 VCF file and tbi file

Chromosome 15 VCF file and tbi file

Chromosome 16 VCF file and tbi file

Chromosome 17 VCF file and tbi file

Chromosome 18 VCF file and tbi file

Chromosome 19 VCF file and tbi file

Chromosome 20 VCF file and tbi file

Chromosome 21 VCF file and tbi file

Chromosome 22 VCF file and tbi file

Usage

Use Beagle to impute TRs into SNP data:

java -Xmx4g -jar beagle.version.jar \
            gt=SNPs.vcf.gz \
            ref=${chrom}_final_SNP_merged.vcf.gz \
            out=imputed_TR_SNPs

Additional resources

Per locus summary statistics can be downloaded from here. Each table has information on coordinates, repeat unit sequence, and potential overlap with genes listed in GENCODE v22 for repeats in EnsembleTR catalog.

Population-specific per locus statistics on allele frequency, heterozygosity, and the number of called samples can be found here. Statistics are computed using statSTR from the TRTools package.

Version I

For version I of EnsembleTR calls, please use https://ensemble-tr.s3.us-east-2.amazonaws.com/split/ensemble_chr"$chr"_filtered.vcf.gz for VCF file and https://ensemble-tr.s3.us-east-2.amazonaws.com/split/ensemble_chr"$chr"_filtered.vcf.gz.tbi for tbi file.

For version I of phased panels, please use https://ensemble-tr.s3.us-east-2.amazonaws.com/phased-split/chr"$chr"_final_SNP_merged.vcf.gz for VCF file and https://ensemble-tr.s3.us-east-2.amazonaws.com/phased-split/chr"$chr"_final_SNP_merged.vcf.gz.csi for tbi file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ensembletr-1.0.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ensembletr-1.0.1-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file ensembletr-1.0.1.tar.gz.

File metadata

  • Download URL: ensembletr-1.0.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.8.20 Linux/5.14.0-284.18.1.el9_2.x86_64

File hashes

Hashes for ensembletr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 33303c6f7c4124a946f96d8e1d54f869e10d8ef02353ecef27bbe1143584283d
MD5 69ac92e8ce80968a1d4f4dc089bf7962
BLAKE2b-256 0df949b718924067df5bbcd7dea21899c129dfcebed2c66fc2cf68475aaca780

See more details on using hashes here.

File details

Details for the file ensembletr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: ensembletr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.8.20 Linux/5.14.0-284.18.1.el9_2.x86_64

File hashes

Hashes for ensembletr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 173b9757eab5c0f7eecd92fa7e611722f9469aa9a6e034926bf331f0a221adba
MD5 74b01c65efc904e50ea1bffb1f8b65b5
BLAKE2b-256 22a65283cecb27c8baa0ac64496808be41d2ddb08159119c849128f31e14b77c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page