Skip to main content

Generate and store genetic data into an Elasticsearch database.

Project description

genelastic

Genelastic is a set of tools for genetic technologies comparisons.

It includes a set of scripts to import and manage genetics data in an Elasticsearch database, as well as a REST API serving a dedicated user interface to query, visualize and compare imported data.

Table of contents

I - Prerequisites

  • python >= 3.11

II - Installation

  • With pipx (recommended):

    pipx install genelastic
    

    pipx installs genelastic in its own isolated virtual environment and makes all scripts (gnl-*) available globally.

  • Or with pip:

    python -m venv .venv
    source .venv/bin/activate  # Activate the virtual environment
    pip install genelastic
    

    It is recommended to install genelastic inside its own virtual environment to avoid conflicts with globally installed Python packages. To run the scripts, you need to activate the environment first.

Test the installation by running one of the genelastic scripts:

gnl-import -h  # Print the help message and exit.

III - Core concepts

This section introduces the main ideas behind Genelastic. It explains how a bundle (YAML manifest) describes analyses, how files are located with file prefixes and tags, which data file types are supported, and how they are imported into Elasticsearch.

Bundle

A bundle is a YAML manifest that describes one or several analyses.
It defines:

  • metadata for each analysis,
  • the path where the related files are stored,
  • the naming rules (file prefix and tags) to automatically locate those files.

The bundle itself is not imported into Elasticsearch. Instead, gnl-import uses it to build filename patterns (regular expressions with named groups), retrieve matching files, and import both the files' contents and the associated metadata into the database. Because the regex is built from tags, the metadata can always be re-extracted from the filenames if needed.

Analysis

An analysis is the central unit described in a bundle.
It combines:

  • a set of metadata fields (sample, source, reference genome, etc.),
  • one wet lab and one bioinformatics process,
  • a data_path where files are located,
  • and a file_prefix that defines the naming pattern of all files belonging to the analysis.

The analysis acts as a template: it tells Genelastic how to find the right files and how to attach them to the correct metadata.

File prefix and tags

A file prefix is a naming template made of tags, each tag representing a metadata field. When processing a bundle, Genelastic replaces tags with their values to build a filename pattern, a regular expression used to automatically retrieve files in data_path.

By default, tags use % as a start delimiter and "" (empty char) as an end delimiter. Both delimiters can be overridden if needed.

For example, the following tags are all valid:

  • %S uses default start delimiter (%) and default end delimiter (""),
  • %S% uses default start delimiter (%) and custom end delimiter (%),
  • $S$ uses custom start delimiter ($) and custom end delimiter ($).

However, not all characters are allowed as delimiters (see Tags).

Default tags provided by Genelastic:

  • %S => sample_name
  • %F => source
  • %W => wet_process
  • %B => bi_process
  • %D => cov_depth
  • %A => barcode
  • %R => reference_genome

Custom tags can also be defined.

Example

Suppose the bundle defines the following analysis:

---
- analyses:
  - file_prefix: "%S_%F_%W_%B_%D_%R_rep-1"
    data_path: "/data/"
    sample_name: "HG002"  # %S
    source: "CNRGH"  # %F
    wet_process: "novaseqxplus-25b"  # %W
    bi_process: "dragen-4123"  # %B
    cov_depth: 30  # %D
    reference_genome: "hg38"  # %R
# ...

Genelastic expands tags and automatically retrieve matching files:

- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1.cov
- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1.vcf.gz

Data files

A data file is any file belonging to an analysis that is imported into the database. Each data file has a type and an extension.

Supported raw data file types:

  • VCF (gzipped or not): type = vcf, ext = vcf,
  • Coverage in TSV format: type = cov, ext = cov.

Supported metrics data file types:

  • QC: type = qc, ext = yml / yaml,
  • SV (gzipped or not): type = sv, ext = json,
  • Smallvar (gzipped or not): type = smallvar, ext = json.

During import, Genelastic automatically handles gzipped files. Each line of each data file is parsed and transformed into a document, enriched with the analysis metadata, and indexed into the Elasticsearch index corresponding to the file type.

Processes

An analysis can also reference:

  • a wet lab process (sequencing metadata),
  • and a bioinformatics process (pipeline metadata).

These processes provide contextual metadata about how the data was generated (wet lab) and processed (bioinformatics). They enable filtering and comparison of analyses based on production and pipeline characteristics.

Benefits

Using bundles brings several advantages:

  • Metadata is defined once per analysis, not repeated for each file,
  • Files are retrieved automatically using filename patterns built from file prefixes, avoiding manual lists,
  • Metadata and file contents are imported consistently into Elasticsearch,
  • Analyses remain reproducible and traceable thanks to regex-based matching.

IV - Bundle file definition

Bundle

Attributes:

  • version (int): required, version of the bundle. Currently, Genelastic only supports version 3, which is specified in this document,
  • analyses (List[Analysis]): optional, list of analyses to import,
  • wet_processes (List[WetProcess]): optional, wet lab process metadata,
  • bi_processes (List[BiProcess]): optional, bioinformatics process metadata,
  • tags (Tags): optional, custom tags used in the file prefix.

Bundle example:

---
version: 3
analyses:
  - # First analysis definition
  - # Second analysis definition
  - # etc...
wet_processes:
  - # First wet lab process definition.
  - # etc...
bi_processes:
  - # First bioinformatics process definition.
  - # etc...
tags:
  # Tags definition.

Analysis

Attributes:

  • file_prefix (str): required, file prefix to identify analysis files. The file prefix describes the naming convention of the analysis data files using tags. It represents the fixed part of the filename, thus must not include regular expressions. To account for variable filename parts, use the suffix attribute.
  • suffix (str): optional, suffix appended to the file prefix to match data files with varying filename suffix (default: ""),
  • data_path (str): optional, path to the directory where analysis files are stored. If it is relative, the full path is resolved relative to the bundle file location (default: bundle file location),
  • wet_process (str): optional, identifier of the wet lab process used for the analysis,
  • bi_process (str): optional, identifier of the bioinformatics process used for the analysis,
  • sample_name (str): optional, metadata field to define sample name,
  • source (str): optional, metadata field to define source,
  • barcode (str): optional, metadata field to define barcode,
  • reference_genome (str): optional, metadata field to define reference genome,
  • flowcell (str): optional, metadata field to define flowcell,
  • lanes (List[int]): optional, metadata field to define lanes,
  • seq_indices (List[str]): optional, metadata field to define sequencing indices,
  • cov_depth (int): optional, metadata field to define coverage depth,
  • qc_comment (str): optional, metadata field to define quality control comment.

Analysis example:

---
- analyses:
  - file_prefix: "%S_%F_%W_%B_%D_%R_rep-1"
    suffix: "_(?P<type>sv|smallvar|qc)"
    data_path: "/data/"
    sample_name: "HG002"  # %S
    source: "CNRGH"  # %F
    wet_process: "novaseqxplus-25b"  # %W
    bi_process: "dragen-4123"  # %B
    cov_depth: 30  # %D
    reference_genome: "hg38"  # %R

WetProcess

Attributes:

  • proc_id (str): required, identifier of the wet lab process,
  • manufacturer (str): required, sequencer manufacturer,
  • sequencer (str): required, sequencer model,
  • generic_kit (str): required, generic kit name,
  • fragmentation (int): required, fragment size (bp),
  • reads_size (int): required, reads size,
  • input_type (str): required, input type,
  • amplification (str): required, amplification method,
  • flowcell_type (str): required, flowcell type,
  • sequencing_type (str): required, sequencing type,
  • desc (str): optional, description of the wet lab process.
  • library_kit (str): optional, library kit name,
  • sequencing_kit (str): optional, sequencing kit name,
  • error_rate_expected (float): optional, expected error rate.

WetProcess Example:

---
wet_processes:
  - proc_id: "novaseqxplus-25b"
    manufacturer: "illumina"
    sequencer: "novaseqxplus"
    generic_kit: "truseq-illumina"
    fragmentation: 350
    reads_size: 300
    input_type: "gdna"
    amplification: "pcr-free"
    flowcell_type: "25b"
    sequencing_type: "wgs"

BiProcess

Attributes:

  • proc_id (str): required, identifier of the bioinformatics process,
  • name (str): required, name of the bioinformatics process,
  • pipeline_version (str): required, version of the bioinformatics pipeline,
  • sequencing_type (str): required, sequencing type,
  • steps (List[BiProcess.Step]): optional, list of steps in the bioinformatics pipeline,
  • desc (str): optional, description of the bioinformatics process.

BiProcess.Step

Attributes:

  • name (str): required, name of the step,
  • cmd (str): required, command used in the step,
  • version (str): optional, version of the command,
  • output (str): optional, output of the command.

BiProcess example:

---
bi_processes:
  - proc_id: "dragen-4123"
    name: dragen
    pipeline_version: "4.1.2.3"
    steps:
      - {name: basecalling, cmd: bclconvert, version: "3.9.3.2"}
      - {name: trimming, cmd: dragen}
      - {name: mapping, cmd: dragmap}
      - {name: postmapping, cmd: dragen, version: "4.1.23"}
      - {name: smallvarcalling, cmd: dragen, version: "4.1.23"}
      - {name: svcalling, cmd: dragen, version: "4.1.23"}
      - {name: secondary_qc, cmd: dragen, version: "4.1.23"}
    sequencing_type: "wgs"

Tags

Attributes:

  • delimiter: (Tags.Delimiter): optional, defines the special characters used to delimit tags within a file prefix. Each tag is identified by a start and an optional end delimiter, surrounding a tag name that maps to a metadata field.
  • match: (Dict[str, Tags.Match]): optional, custom tags definition. Keys are the tag names, and values are the corresponding tag definitions. A tag name must contain at least one alphanumeric character: a-z, A-Z and 0-9.

Tags.Delimiter

Attributes:

  • start (str): optional, character marking the beginning of a tag. It must be one special character, excluding the following: (, ), ?, <, > (default: %),
  • end (str): optional, character marking the end of a tag. It must be one special character, excluding the following: (, ), ?, <, > (default: ""). If omitted or empty, the tag ends immediately after the tag name. Default: "" (no explicit end delimiter).

Tags.Match

Attributes:

  • field (str): required, metadata field name associated with the tag,
  • regex (str): required, regular expression to match the expected metadata value in the filename.

Tags example:

---
tags:
  delimiter:
    start: "#"
    end: "#"
  match:
    Z:
      field: custom_field
      regex: "[^_-]+"

V - Scripts usage

Genelastic provides the following scripts:

  • gnl-data: Create a bundle with randomly generated analyses, metadata, processes, and data files for testing,
  • gnl-validate: Standalone script that statically validates YAML bundles. Useful locally or in CI pipelines to ensure bundles follow the expected schema before integrating them into a repository,
  • gnl-import: Import bundles to an Elasticsearch database,
  • gnl-info: Query information about genetic data already imported,
  • gnl-integrity: Check the integrity of previously imported data.

Import behavior

Among these, gnl-import is the central script. It starts by statically validating YAML bundles, like gnl-validate does. Then, it parses bundles data files, and imports the resulting documents (records built from the files and metadata) into the Elasticsearch database.

Dry-run

gnl-import can run in "dry-run" mode, letting you check what would happen without touching Elasticsearch.

  • -D: parse the files, build the documents, but stop before import. Useful to check that the right files are selected and can be read correctly.
  • -DD: only list the files matching your file prefix, without parsing or import. Useful to check that your file_prefix is correct.

Single-match vs multi-match

By default, gnl-import runs in single-match mode:

  • Every tag used in the file_prefix must have its corresponding metadata field explicitly defined.
  • For example, if the file prefix contains %S, then the field sample_name must be provided in the analysis. Otherwise, an error is raised,
  • Only files that exactly match the declared metadata are imported.

This mode is best suited for environments where data files belonging to a single analysis are grouped together in a dedicated directory.

With the option --multi-match, the behavior changes:

  • Undefined metadata fields are tolerated,
  • When a tag has no defined value, it is replaced by its corresponding regular expression. For example, %S normally maps to sample_name. If this metadata field is not defined, %S expands to its default regex ([^_]+), which matches any non-underscore sequence. Users can override these defaults in the bundle (via Tags.Match).
  • All files matching the defined metadata and regex expansions are collected,
  • Each unique combination of values found in the filenames results in a new analysis with its own ID.

This mode is best suited for environments where multiple analyses share a single directory of data files. It is more flexible than single-match, but also carries the risk of importing more files than expected if regex patterns are too permissive. Carefully check matches in dry-run mode (-D / -DD) before importing.

Example:

---
- analyses:
  - file_prefix: "%S_%F_%W_%B_%D_%R_rep-1"
    data_path: "/data/"
    # sample_name (%S) is omitted
    source: "CNRGH"  # %F
    wet_process: "novaseqxplus-25b"  # %W
    bi_process: "dragen-4123"  # %B
    cov_depth: 30  # %D
    reference_genome: "hg38"  # %R

In multi-match mode, the %S tag expands to its regex ([^_]+), so all sample names are accepted. The following files are matched:

- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1.vcf
- HG003_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1.vcf
- HG004_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1.vcf

Three analyses are automatically created, each with its own sample_name derived from the filename:

  • HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1,
  • HG003_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1,
  • HG004_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1

Metrics

In addition to raw data files (.vcf, .cov), Genelastic also supports metrics data files.

A metrics data file is identified by the .metrics suffix, which appears before the file extension. This suffix can optionally include metadata about the tools used to generate the metrics and their versions.

The format for this metadata is:

.metrics_<TOOL>-<VERSION>[_<TOOL>-<VERSION>...]
  • Multiple tools can be listed, separated by underscores (_),
  • Each version must contain at least one number and is separated from the tool name by a hyphen (-).

For raw files, their type is the same as their extension.

For metrics files, however, multiple metrics types may share the same extension (e.g. .json). To distinguish them, the suffix attribute in the analysis definition must specify a regex that extracts the type from the filename. If no suffix is defined, Genelastic will raise an error.

Example:

---
- analyses:
  - file_prefix: "%S_%F_%W_%B_%D_%R_rep-1"
    suffix: "_(?P<type>sv|smallvar|qc)"
    data_path: "/data/"
    sample_name: "HG002"  # %S
    source: "CNRGH"  # %F
    wet_process: "novaseqxplus-25b"  # %W
    bi_process: "dragen-4123"  # %B
    cov_depth: 30  # %D
    reference_genome: "hg38"  # %R

This will match files such as:

- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1_smallvar.vcf
- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1_smallvar.cov
- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1_smallvar.metrics_happy-2-0-0_giab-3-0-0.json
- HG002_CNRGH_novaseqxplus-25b_dragen-4123_30_hg38_rep-1_qc.metrics.json

In this example, the analysis includes both raw data files and metrics files. The value of the metrics field in the imported document depends on the file type:

  • Non-metrics files (e.g. VCF, Coverage):

    {
      "metrics": null
    }
    
  • Metrics files without tool metadata (e.g. QC):

    {
      "metrics": []
    }
    
  • Metrics files with tool metadata (e.g. Smallvar):

    {
      "metrics": [
        {
          "tool": "happy",
          "version": "2.0.0"
        },
        {
          "tool": "giab",
          "version": "3.0.0"
        }
      ]
    }
    

VI - Servers usage

Genelastic includes two servers: an API server and a UI server. The UI does not communicate directly with Elasticsearch: it always goes through the API server, which acts as a gateway and provides HTTP endpoints.

API server

The API server needs to know how to connect to Elasticsearch. Configure it through the following environment variables:

  • GENAPI_ES_URL: URL of the Elasticsearch server,
  • GENAPI_ES_ENCODED_API_KEY: Encoded API key,
  • GENAPI_ES_INDEX_PREFIX: Prefix to identify indices of interest,
  • GENAPI_ES_CERT_FP: Certificate fingerprint of the Elasticsearch server.

Start the API server in development mode:

gnl-start-api dev

UI server

The UI server only needs the address of the API server:

  • GENUI_API_URL: URL of the API server.

Start the UI server in development mode:

gnl-start-ui dev

Development vs production

Both gnl-start-api and gnl-start-ui support two modes:

  • dev: development mode (hot reload, debug logs, not optimized),
  • prod: production mode (optimized build, suitable for deployment).

VII - For developers

Prerequisites

  • python >= 3.11
  • uv >= 0.9
  • make

Installation

To install development dependencies, run the following commands:

python -m venv .venv
source .venv/bin/activate
make

Pre-commit hooks setup

This project uses pre-commit to manage Git hooks scripts. To install project hooks, run:

pre-commit install

After that, each commit will succeed only if all hooks (defined in .pre-commit-config.yaml) pass.

If necessary (though not recommended), you can skip these hooks by using the --no-verify / -n option when committing:

git commit -m "My commit message" --no-verify # This commit will not run installed hooks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genelastic-0.9.0.tar.gz (208.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genelastic-0.9.0-py3-none-any.whl (227.5 kB view details)

Uploaded Python 3

File details

Details for the file genelastic-0.9.0.tar.gz.

File metadata

  • Download URL: genelastic-0.9.0.tar.gz
  • Upload date:
  • Size: 208.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for genelastic-0.9.0.tar.gz
Algorithm Hash digest
SHA256 6a3939f83d315b1240782ed94506dd7d4e140ca44103f8a0fc9af0c7dffdf813
MD5 75a1624a7bf864ddeabd3e58ad1ef29e
BLAKE2b-256 e09e865470496fd2410425ff608e58445641083987324b22229f880a65f6d304

See more details on using hashes here.

File details

Details for the file genelastic-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: genelastic-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 227.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for genelastic-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3106662bde510346494370a0b7029fbbc2ef6903e96b49b3ae208b282502cba5
MD5 7fbffdaa73a4fdf4466c4807f85f8ad2
BLAKE2b-256 9cf8a2938d334ce33067c0c79e19be42f75c7cd4b38794117165881aab860e8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page