Skip to main content

No project description provided

Project description

EGA - RefgenDetector

RefgenDetector is a bioinformatics tool that infers the reference genome assembly used to create aligment files (BAM/CRAM/header) and VCFs.

Aligment Files

It identifies major genome releases and derived assemblies across humans and multiple other species by analyzing contig names and lengths from the header. Benchmarking against 94 synthetic datasets achieved a 100% accuracy rate, while large-scale testing on 918,404 real-world files demonstrated 97.13% correctness, failing only when files’ headers are incomplete.

Description

RefgenDetector is able to infer the following reference genomes:

Primates

👤 Homo sapiens

  • hg16
  • hg17
  • hg18
  • GRCh37
  • GRCh38
  • T2T

🐒 Pan troglodytes

  • pantro3_0
  • Pan_troglodytes-2.1

🐵 Macaca mulatta

  • Mmul10
  • rheMac8
  • rheMac3

Rodents

🐭 Mus musculus

  • mm7
  • mm8
  • mm9
  • mm10
  • mm39

🐀 Rattus norvegicus

  • mRatBN7_2
  • Rnor_6_0

Other Mammals

🐷 Sus scrofa

  • Sscrofa10_2
  • Sscrofa11_1

Vertebrates (Non-Mammalian)

🐟 Danio Rerio

  • danRer10
  • danRer11

Invertebrates

🪰 Drosophila Melanogaster

  • dm5
  • dm6

🐛 Caenorhabditis elegans

  • WBcel215
  • WBcel235

Microorganisms & Plants

🧫 Escherichia coli

  • ASM886v2
  • ASM584v2

🌱 Arabidopsis thaliana

  • TAIR

🍺 Saccharomyces cerevisiae

  • R64

ref_manager.py - Customize the assemblies database.

ref_manager.py provides command-line management of reference genomes used by RefgenDetector. It allows users to add custom assemblies from FASTA index (.fai) files, list all available references, and remove previously added custom entries without modifying the source code.

Usage

python ref_manager.py <command> [options]

Commands

Add a reference

python ref_manager.py add <genome.fai> <reference_name> <species> # script
refgenDetector-manager add <genome.fai> <reference_name> <species> # pip installation

Registers a new reference from a valid .fai file. If the contig structure matches an existing reference, the entry is not added.

List references

python ref_manager.py list # scripts
refgenDetector-manager list # pip installation

Displays all available references, including both built-in and user-defined assemblies.

Remove a reference

python ref_manager.py remove <reference_name> # scripts
refgenDetector-manager remove <reference_name> # pip installation

Removes a custom reference from the local database. Built-in references cannot be removed.

Notes

  • Custom references are stored separately from the default reference database.
  • Input files must be valid FASTA index files generated with samtools faidx.
  • Duplicate assemblies are detected based on exact contig composition.

Variant Calling Files (VCFs)

From VCF files only 4 human assemblies can be inferred:

  • Hg18
  • GRCh37
  • GRCh38
  • T2T

Two different sources of information are used to infer the reference genome from variant calling files

  • Header

In the VCF specification it is recommended, but not mandatory that the VCF header includes tags describing the reference and contigs backing the data contained in the file. When present, the tool will analyze this information and output the reference genome version based on the contig lengths, following the same logic of the aligment files inference.

  • Variants

To infer the reference genome from a VCF the tool will read the VCF file in chunks of 100.000 variants, avoiding to load the complete file in memory. The POS and REF columns will be extracted and compared to the msgpack files.

The msgpack files were created comparing the nucleotides in each position for hg18, GRCh37, GRCh38 and T2T. Each file contains a list of the positions where each reference had a different nucleotide (distinguishing positions).

By getting the number of matches between these distinguishing positions and the REF present in the VCF we infer the reference genome version used to call the variants.

Requirements

  • Python 3.10.6

Depending on how you want to install the package:

  • pip
  • Docker

Installation

Cloning this repository

  1. Clone this repository

2.git clone https://github.com/EGA-archive/refgenDetector.git cd refgenDetector pip install -e .

  1. $ python3 refgenDetector_main.py -h

  2. Download the msgpack files for the inference with VCFs: Download the msgpack reference

  3. Move the msgpack to the correct path:

mv msgpacks.zip /refgenDetector/src/refgenDetector/
unzip /refgenDetector/src/refgenDetector/msgpacks.zip

From pypi

$ pip install refgenDetector

Usage

You can get the help menu by running:

$ refgenDetector -h
usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE [-h] -f FILE -t {BAM/CRAM,Header,VCF,BIM} [--md5] [-a] [-v MAX_N_VAR] [-m MATCHES] [-r]

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Input file path
  -t {BAM/CRAM,Header,VCF,BIM}, --type {BAM/CRAM,Header,VCF,BIM}
                        Type of files to analyze.
  --md5                 Print md5 values if present in header.
  -a, --assembly        Print assembly if present in header.
  -v MAX_N_VAR, --max_n_var MAX_N_VAR
                        Maximum number of variants to read before stopping inference. The file is processed in chunks of 100,000 variants, so this value must be a multiple of 100,000 (e.g. 100000,
                        200000, 300000, ...).
  -m MATCHES, --matches MATCHES
                        Number of matches required before stopping. [DEFAULT:5000]
  -r, --resources       When set, print execution time, CPU, memory, and disk I/O usage

Test RefgenDetector

In the folder examples you can find headers, alignment and variant files to test the working of RefgenDetector.

Licence and funding

RefgenDetector is released under GNU General Public License v3.0.

It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies 2019-2021 and 2022-2023).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refgendetector-3.0.5.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refgendetector-3.0.5-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file refgendetector-3.0.5.tar.gz.

File metadata

  • Download URL: refgendetector-3.0.5.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for refgendetector-3.0.5.tar.gz
Algorithm Hash digest
SHA256 aa84adb034e0c6ed7fdcdd26feb4be0277538019b75dce125e5740f13247c42a
MD5 3973e0068967ec9e3730401603837dd4
BLAKE2b-256 4fded507f46cb8e00d96c0b4c0da14b7b3fb829394384fb9735c288d699ff157

See more details on using hashes here.

File details

Details for the file refgendetector-3.0.5-py3-none-any.whl.

File metadata

  • Download URL: refgendetector-3.0.5-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for refgendetector-3.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0e5120992052f2040f26f9b71850f7c28bf8b78f4c6fe95e0dbe01a9f2d5c294
MD5 703b5da21d5526f1c700078b9ac6400c
BLAKE2b-256 5ac6dd1beb726514e368fefb84fcc8239478e3a9684041b609d5d3711d14db0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page