Skip to main content

Utilities for analyzing short tandem repeats (STRs)

Project description

This repo contains scripts and utilities for analyzing tandem repeats (TRs).

Installation

To install the latest version using pip, run:

python3 -m pip install --upgrade git+https://github.com/broadinstitute/str-analysis.git

or use the docker image (though it may not have the latest version of the code):

docker run -it weisburd/str-analysis:latest

Tools

  • call_non_ref_motifs (docs) - takes a bam/cram file and, optionally, an ExpansionHunter variant catalog. Then, for each locus, it determines which STR motifs are supported by reads overlapping that locus before running ExpansionHunter on the motif(s) it detected.

  • filter_vcf_to_STR_variants - takes a single-sample VCF file and filters it to the INS/DEL variants that represent tandem repeat expansions or contractions by peforming brute-force k-mer search on each variant's inserted or deleted bases. This tool was a core part of Weisburd, B., Tiao, G. & Rehm, H. L. Insights from a genome-wide truth set of tandem repeat variation. (2023)

  • merge_loci - takes one or more STR catalogs and combines them into a single catalog while removing duplicates based on overlap and repeat motif.

  • annotate_and_filter_str_catalog - takes an STR catalog and annotates the loci based on their overlap with genes
    and known disease associated STRs. It then allows filtering by motif size, gene region, and various other criteria.

  • compute_catalog_stats - takes an annotated catalog output by the annotate_and_filter_str_catalog script and computes various summary statistics about it.

  • add_offtarget_regions - takes an ExpansionHunter variant catalog and adds a list of off-target regions to each locus definition by querying a database of off-target regions that have been precomputed for each TR motif. This database was generated by using wgsim to simulate fully-repetitive reads for each motif, and then recording where these reads mapped on hg19 and hg38 after aligning them using bwa.

  • add_adjacent_loci_to_expansion_hunter_catalog - takes an ExpansionHunter variant catalog and a bed file containing all simple repeats in the reference genome. Outputs a new catalog with updated LocusStructures and ReferenceRegions that include any adjacent repeats found near each locus in the input catalog.

  • check_trios_for_mendelian_violations - takes a table of combined ExpanssionHunter calls generated by the combine_str_json_to_tsv as well as a FAM or PED file with parent/child relationships, and outputs a table of mendelian violations in the callset.

  • simulate_str_expansions - uses wgsim to generate .bam files with simulated read data containing STR expansions at a given locus, and having a given number of repeats, motif, zygosity, etc.

  • ExpansionHunterDenovo output post-processing:

    • annotate_EHdn_locus_outliers - takes an ExpansionHunterDenovo outlier result table (locus outliers or case-control) as well as a bed file containing all simple repeats in the reference genome and, optionally, a gene models GTF file, a variant catalog of known-disease associated loci, and/or other bed files with genomic regions of interest. Outputs a new table where each EHdn outlier is annotated with multiple columns related to the provided reference data.
    • convert_annotated_EHdn_locus_outliers_to_expansion_hunter_catalog - takes the output table from annotate_EHdn_locus_outliers and lets the user apply a range of filters before writing out the passing loci to an ExpansionHunter variant catalog.
  • gnomAD STR calls:

  • post-process and combine ExpansionHunter outputs:

    • combine_str_json_to_tsv - takes a set of ExpansionHunter json output files and combines them into a single tsv table.
    • combine_json_to_tsv - takes a set of arbitrary json files that share the same schema and combines their top-level fields into a single tsv file.
    • copy_EH_vcf_fields_to_json - takes the ExpansionHunter output vcf and json file for a given sample and copies fields that are only present in the vcf to the json file.
    • run_reviewer - takes ExpansionHunter output files for a single sample and runs REViewer on the subset of loci where the genotypes exceed locus-specific thresholds specified in the variant catalog.
  • format converters:

    • convert_bed_to_expansion_hunter_variant_catalog
    • convert_expansion_hunter_variant_catalog_to_gangstr_spec
    • convert_expansion_hunter_variant_catalog_to_hipstr_format
    • convert_expansion_hunter_variant_catalog_to_trgt_catalog
    • convert_expansion_hunter_variant_catalog_to_longtr_format
    • convert_gangstr_spec_to_expansion_hunter_variant_catalog
    • convert_expansion_hunter_denovo_locus_tsv_to_bed
    • convert_gangstr_vcf_to_expansion_hunter_json
    • convert_hipstr_vcf_to_expansion_hunter_json
    • convert_strling_calls_to_expansion_hunter_json

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

str_analysis-1.2.10.tar.gz (388.9 kB view details)

Uploaded Source

Built Distribution

str_analysis-1.2.10-py3-none-any.whl (621.3 kB view details)

Uploaded Python 3

File details

Details for the file str_analysis-1.2.10.tar.gz.

File metadata

  • Download URL: str_analysis-1.2.10.tar.gz
  • Upload date:
  • Size: 388.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for str_analysis-1.2.10.tar.gz
Algorithm Hash digest
SHA256 e69b16c5174339557e5ab8cbcacef9baf0581565587fc54dd45d6586ad2637a5
MD5 0070789242c85817a13bac7e6eba36d2
BLAKE2b-256 1406cd97c3c7364d3da44d540c47690edf7b2e7a362aef7c23d4560a1a9d1bd5

See more details on using hashes here.

File details

Details for the file str_analysis-1.2.10-py3-none-any.whl.

File metadata

File hashes

Hashes for str_analysis-1.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 f2e80fd84670740c5d81d32a751e9b2035723ad8cf249489b9a53ae846582707
MD5 39bd25b0d7c7032bc4b7ef672bc46f34
BLAKE2b-256 ab7d2e64784cdf270af3e56ae49a8f5dddd731fa03f1c7934f8530dc73e519b6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page