Load OMOP Athena vocabulary exports into DuckDB
Project description
OMOP Athena → DuckDB
athena2duckdb converts an extracted OMOP vocabulary download from
OHDSI Athena into a ready-to-query DuckDB file with
typed CDM tables, primary keys, and automatic row-count validation. The loader
ingests the standard OMOP vocab files (CONCEPT.csv, VOCABULARY.csv, etc.)
and skips auxiliary exports like CONCEPT_CPT4.csv or README.txt by default.
Features
- Discovers standard OMOP vocabulary files such as
CONCEPT.csv,VOCABULARY.csv,CONCEPT_RELATIONSHIP.csv, and more. - Streams each file into DuckDB using
read_csvwith quoting/escaping disabled, preventing parse failures caused by embedded quotes or backslashes, while the CLI shows a live progress bar per table. - Loads recognised vocab files into typed tables (INTEGER, DATE, VARCHAR) that match the CDM DDL with primary keys already enforced (secondary indexes can be added later if needed).
- Always performs row-count verification to ensure the database matches source files.
Installation
From PyPI:
pip install athena2duckdb
For local development:
uv sync
or build/install directly from the project root:
uv build
pip install dist/athena2duckdb-*.whl
CLI Usage
uv run athena2duckdb /path/to/athena-export --verbose
Arguments:
| Flag | Description |
|---|---|
input_dir |
Directory that contains the Athena CSV/TSV files. |
-o, --out |
Output DuckDB database file (default omop_vocab.duckdb). |
--sep |
Field delimiter (default tab). |
--encoding |
Source file encoding (default UTF-8). |
--threads |
Number of DuckDB threads to use. |
--schema |
DuckDB schema name for created tables (default main). |
--overwrite |
Replace an existing DuckDB file if present. |
--verbose |
Emit INFO-level logs during the load. |
Example
uv run athena2duckdb data/
Sample output:
Loaded 10 tables into omop_vocab.duckdb.
Tables: concept, concept_ancestor, concept_class, concept_relationship, concept_synonym,
domain, drug_strength, source_to_concept_map, relationship, vocabulary
OK table=concept csv_rows=93,547 table_rows=93,547
...
Programmatic API
from pathlib import Path
from athena2duckdb import CSVOptions, load_vocab_dir, verify_row_counts
summary = load_vocab_dir(Path("data"), Path("omop_vocab.duckdb"), schema="cdm")
results = verify_row_counts(summary.db_path, summary.vocab_files, schema=summary.schema)
Testing
uv run pytest
Releasing
See RELEASING.md.
License
This project is licensed under the MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file athena2duckdb-0.1.1.tar.gz.
File metadata
- Download URL: athena2duckdb-0.1.1.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7d7b6f3f1463d72b4d28a06c46d36fd57b30622ff5a8036af792ce39ea4e30b
|
|
| MD5 |
98bde55a0fa7f45aae67df2f2b7eeb87
|
|
| BLAKE2b-256 |
76dfab4c01c7ac3a3712c72ad018576b4e0110bc0631002443abdf638f573623
|
Provenance
The following attestation bundles were made for athena2duckdb-0.1.1.tar.gz:
Publisher:
publish.yml on sidataplus/athena2duckdb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
athena2duckdb-0.1.1.tar.gz -
Subject digest:
a7d7b6f3f1463d72b4d28a06c46d36fd57b30622ff5a8036af792ce39ea4e30b - Sigstore transparency entry: 990392549
- Sigstore integration time:
-
Permalink:
sidataplus/athena2duckdb@df4196b8ac234e493e00d7980cb82db3f88b3707 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/sidataplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@df4196b8ac234e493e00d7980cb82db3f88b3707 -
Trigger Event:
push
-
Statement type:
File details
Details for the file athena2duckdb-0.1.1-py3-none-any.whl.
File metadata
- Download URL: athena2duckdb-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df4f65c080dc868923e6f07ef46f034d5d379847e1e66248bd936078d67a71c6
|
|
| MD5 |
741431441c3cde07cb32465054178010
|
|
| BLAKE2b-256 |
823931617e77893ea0bd206828e23098d705c425a60fe9f9ee261f4b5dec8497
|
Provenance
The following attestation bundles were made for athena2duckdb-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on sidataplus/athena2duckdb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
athena2duckdb-0.1.1-py3-none-any.whl -
Subject digest:
df4f65c080dc868923e6f07ef46f034d5d379847e1e66248bd936078d67a71c6 - Sigstore transparency entry: 990392556
- Sigstore integration time:
-
Permalink:
sidataplus/athena2duckdb@df4196b8ac234e493e00d7980cb82db3f88b3707 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/sidataplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@df4196b8ac234e493e00d7980cb82db3f88b3707 -
Trigger Event:
push
-
Statement type: