vcf-to-duckdb

Convert a VCF file to a DuckDB database

These details have not been verified by PyPI

Project description

vcf-to-duckdb

This is a Python package to convert a VCF (Variant Call Format) file to a DuckDB database (exported as Parquet files and accompanying SQL schema).

Features

Processes large VCF files efficiently using multithreaded batch processing.
Infers data types from VCF headers.
Parses and separates data in compound INFO fields (e.g. from VEP, SnpEFF, etc.).
URL-decodes specified fields and detects fields still needing decoding.

Installation

Install the required system dependencies:
- pyenv
- Poetry
- bcftools
Install the required Python version (developed with 3.12.3, but other 3.12+ versions should work):
```
pyenv install "$(cat .python-version)"
```
Confirm that python maps to the correct version:
```
python --version
```
Set the Poetry interpreter and install the Python dependencies:
```
poetry env use "$(pyenv which python)"
poetry install
```

A requirements.txt file is also available and kept in sync with Poetry dependencies in case you don't want to use Poetry, or you can use vcf-to-duckdb via docker: docker pull dmccabe606/vcf-to-duckdb:latest.

Usage

from pathlib import Path
from vcf_to_duckdb.convert_utils import convert

convert(
    vcf_path=Path("input.vcf.gz"),
    tab_path=Path("tmp.tsv"),
    db_path=Path("tmp.db"),
    parquet_dir_path=Path("output_parquet"),
    multiallelics=False,
    compound_info_field_map={
        "ANN": [{"split": "\\s*\\|\\s*"}],
        "CSQ": [{"split": "\\s*\\|\\s*"}],
        "LOF": [
            {"split": ","},
            {"extract": "^\\((.+)\\)$"},
            {"split": "\\s*\\|\\s*"},
        ],
        "NMD": [
            {"split": ","},
            {"extract": "^\\((.+)\\)$"},
            {"split": "\\s*\\|\\s*"},
        ],
    },
    url_encoded_col_name_regexes=[
        "^civic_desc$",
        "^clndn$",
        "^clnvi$",
        "^oc_",
        "^oncokb_muteff$",
        "^oncokb_oncogenic$",
        "^hgnc",
    ],
)

Or using the CLI and the provided example config.json:

python -m vcf_to_duckdb convert \
    --vcf="input.vcf.gz" \
    --tab="tmp.tsv" \
    --db="tmp.duckdb" \
    --parquet-dir="output_parquet" \
    --no-multiallelics \
    --config="config.json"

Database Schema

The resulting DuckDB database (and exported Parquet files) have the following schema:

-- value and info metadata (a structured form of the VCF header)
CREATE TABLE val_info_types (
    id VARCHAR,
    number VARCHAR,
    "type" VARCHAR,
    description VARCHAR,
    kind VARCHAR,
    has_children BOOLEAN,
    parent_id VARCHAR,
    ix UBIGINT,
    col_def VARCHAR,
    is_list BOOLEAN,
    v_col_name VARCHAR,
    id_snake VARCHAR
);

-- variants and their top-level fields
CREATE TABLE variants (
    vid UINTEGER PRIMARY KEY,
    chrom VARCHAR NOT NULL,
    pos UINTEGER NOT NULL,
    id VARCHAR,
    "ref" VARCHAR,
    alt VARCHAR,
    qual VARCHAR,
    filters VARCHAR[]
);

-- the variants' values and info field data
CREATE TABLE vals_info (
    vid UINTEGER,
    kind VARCHAR,
    k VARCHAR NOT NULL,
    v_boolean BOOLEAN,
    v_varchar VARCHAR,
    v_integer INTEGER,
    v_float FLOAT,
    v_json JSON,
    v_boolean_arr BOOLEAN[],
    v_varchar_arr VARCHAR[],
    v_integer_arr INTEGER[],
    v_float_arr FLOAT[],
    v_json_arr JSON[],
    FOREIGN KEY (vid) REFERENCES variants (vid)
);

Both values (i.e. data specified by the FORMAT column in the VCF) and INFO fields (e.g. annotations) are loaded into vals_info, where k is the key/name of the value and the value is present in exactly one of the v_* columns (depending on its type). Essentially, the schema treats the data in the VCF file as a typed key-value store.

Typing

The converter automatically maps VCF data types to appropriate DuckDB types:

Integer → INTEGER
Float → FLOAT
String → VARCHAR
Character → VARCHAR
Flag → BOOLEAN

To handle fields that the header indicates contain multiple values (i.e. Number=.), these values are stored in columns like v_varchar_arr instead of v_varchar.

INFO fields identified as pipe-delimited compound fields via the compound-info-field argument are stored as JSON columns. The keys of the objects are identified by parsing the field's description in the header, e.g.

##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">

will result in data like

[
  {
    "gene_name": "PTEN",
    "gene_id": "123",
    // etc.
  },
  // etc.
]

being loaded into a row's v_json_arr column.

Querying

The schema above was inspired by ELT data warehouses, and was thus designed to be universal and be able to store data from any VCF file, as opposed to a more explicit "wide" schema where values and INFO field data are populated directly into columns. This makes creating the database fast and reliable, but defers the responsibility of understanding and extracting the data to downstream tools.

So, after creating the database, it might be helpful to use CTEs and database views to make querying easier. For instance, to simulate a wide table containing variants' top-level values (defined in the FORMAT column of the VCF), one could define these views:

import duckdb

with duckdb.connect("tmp.duckdb") as db:
    db.sql("IMPORT DATABASE './output_parquet'")

    # separate `vals_info` into views for values and INFO
    db.sql("""
        CREATE OR REPLACE VIEW vals AS (
            SELECT
                *
            FROM
                vals_info
            WHERE
                kind = 'val'
        )
    """)
    
    db.sql("""
        CREATE OR REPLACE VIEW info AS (
            SELECT
                *
            FROM
                vals_info
            WHERE
                kind = 'info'
        )
    """)
    
    # make a wide view for all top-level values
    db.sql("""
        CREATE OR REPLACE VIEW vals_wide AS (
            SELECT
                DISTINCT vals.vid,
                t_ad.ad[1] AS ref_count,
                t_ad.ad[2] AS alt_count,
                t_af.af,
                t_dp.dp,
                t_gt.gt,
                t_ps.ps
            FROM
                vals
    
            LEFT OUTER JOIN (
                SELECT
                    vid,
                    v_integer_arr AS ad
                FROM
                    vals
                WHERE
                    k = 'ad'
            ) t_ad ON vals.vid = t_ad.vid
    
            LEFT OUTER JOIN (
                SELECT
                    vid,
                    v_float AS af
                FROM
                    vals
                WHERE
                    k = 'af'
            ) t_af ON vals.vid = t_af.vid
    
            LEFT OUTER JOIN (
                SELECT
                    vid,
                    v_integer AS dp
                FROM
                    vals
                WHERE
                    k = 'dp'
            ) t_dp ON vals.vid = t_dp.vid
    
            LEFT OUTER JOIN (
                SELECT
                    vid,
                    v_varchar AS gt
                FROM
                    vals
                WHERE
                    k = 'gt'
            ) t_gt ON vals.vid = t_gt.vid
    
            LEFT OUTER JOIN (
                SELECT
                    vid,
                    v_integer AS ps
                FROM
                    vals
                WHERE
                    k = 'ps'
            ) t_ps ON vals.vid = t_ps.vid
        )
    """)

Limitations

No test suite (yet)
This isn't tested on VCFs containing structural variants and is unlikely to work as expected.
The JSON objects used to store compound info fields use strings for all values, since the VCF header contains no information about these types. Tools like VEP arguably abuse the VCF spec by storing data like foo|14.1|bar in single fields, but there's nothing we can do about this.
This package doesn't currently handle multi-allelic fields, so it's recommended to pre-process the VCF with bcftools norm --multiallelics=- and supply the --no-multiallelics argument here.
Which column (v_boolean, v_varchar_arr, etc.) each kind of data is stored in is determined by what's in the VCF header, which might not be reliable.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.1

Mar 11, 2026

This version

0.4.0

Oct 7, 2025

0.3.0

May 14, 2025

0.2.1

Jan 2, 2025

0.2.0 yanked

Jan 2, 2025

Reason this release was yanked:

no readme yet

0.1.0 yanked

Dec 3, 2024

Reason this release was yanked:

no readme yet

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_to_duckdb-0.4.0.tar.gz (17.3 kB view details)

Uploaded Oct 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vcf_to_duckdb-0.4.0-py3-none-any.whl (15.9 kB view details)

Uploaded Oct 7, 2025 Python 3

File details

Details for the file vcf_to_duckdb-0.4.0.tar.gz.

File metadata

Download URL: vcf_to_duckdb-0.4.0.tar.gz
Upload date: Oct 7, 2025
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for vcf_to_duckdb-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`02bdc14685a03c3f0d6226faea500e54fb2c6d2b17b728cac38433f9e5cf1749`
MD5	`41c7058de18722917f82264ccd6d3a74`
BLAKE2b-256	`dd4400e56adad576b7cddaa7b365887dc0d9a2d1945379f883a38fc94e21387c`

See more details on using hashes here.

File details

Details for the file vcf_to_duckdb-0.4.0-py3-none-any.whl.

File metadata

Download URL: vcf_to_duckdb-0.4.0-py3-none-any.whl
Upload date: Oct 7, 2025
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for vcf_to_duckdb-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ca4480d1b12fa049cdeaa2aea1d782e42cd2a41c5637f31b9233cd6f0fc82ea`
MD5	`75d2fe02bd5ac108ed771d7996d67577`
BLAKE2b-256	`ce7f097c5e5dc9b19073c882b41f97aad144ab202c5f28babb6146a5c7ba3c14`

See more details on using hashes here.

vcf-to-duckdb 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

vcf-to-duckdb

Features

Installation

Usage

Database Schema

Typing

Querying

Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes