Skip to main content

Load OMOP Athena vocabulary exports into DuckDB

Project description

OMOP Athena → DuckDB

athena2duckdb converts an extracted OMOP vocabulary download from OHDSI Athena into a ready-to-query DuckDB file with typed CDM tables, primary keys, and automatic row-count validation. The loader ingests the standard OMOP vocab files (CONCEPT.csv, VOCABULARY.csv, etc.) and skips auxiliary exports like CONCEPT_CPT4.csv or README.txt by default.

Features

  • Discovers standard OMOP vocabulary files such as CONCEPT.csv, VOCABULARY.csv, CONCEPT_RELATIONSHIP.csv, and more.
  • Streams each file into DuckDB using read_csv with quoting/escaping disabled, preventing parse failures caused by embedded quotes or backslashes, while the CLI shows a live progress bar per table.
  • Loads recognised vocab files into typed tables (INTEGER, DATE, VARCHAR) that match the CDM DDL with primary keys already enforced (secondary indexes can be added later if needed).
  • Always performs row-count verification to ensure the database matches source files.

Installation

From PyPI:

pip install athena2duckdb

For local development:

uv sync

or build/install directly from the project root:

uv build
pip install dist/athena2duckdb-*.whl

CLI Usage

uv run athena2duckdb /path/to/athena-export --verbose

Arguments:

Flag Description
input_dir Directory that contains the Athena CSV/TSV files.
-o, --out Output DuckDB database file (default omop_vocab.duckdb).
--sep Field delimiter (default tab).
--encoding Source file encoding (default UTF-8).
--threads Number of DuckDB threads to use.
--schema DuckDB schema name for created tables (default main).
--overwrite Replace an existing DuckDB file if present.
--verbose Emit INFO-level logs during the load.

Example

uv run athena2duckdb data/

Sample output:

Loaded 10 tables into omop_vocab.duckdb.
Tables: concept, concept_ancestor, concept_class, concept_relationship, concept_synonym,
domain, drug_strength, source_to_concept_map, relationship, vocabulary
OK        table=concept                  csv_rows=93,547 table_rows=93,547
...

Programmatic API

from pathlib import Path
from athena2duckdb import CSVOptions, load_vocab_dir, verify_row_counts

summary = load_vocab_dir(Path("data"), Path("omop_vocab.duckdb"), schema="cdm")
results = verify_row_counts(summary.db_path, summary.vocab_files, schema=summary.schema)

Testing

uv run pytest

Releasing

See RELEASING.md.

License

This project is licensed under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena2duckdb-0.1.1.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

athena2duckdb-0.1.1-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file athena2duckdb-0.1.1.tar.gz.

File metadata

  • Download URL: athena2duckdb-0.1.1.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for athena2duckdb-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a7d7b6f3f1463d72b4d28a06c46d36fd57b30622ff5a8036af792ce39ea4e30b
MD5 98bde55a0fa7f45aae67df2f2b7eeb87
BLAKE2b-256 76dfab4c01c7ac3a3712c72ad018576b4e0110bc0631002443abdf638f573623

See more details on using hashes here.

Provenance

The following attestation bundles were made for athena2duckdb-0.1.1.tar.gz:

Publisher: publish.yml on sidataplus/athena2duckdb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file athena2duckdb-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: athena2duckdb-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for athena2duckdb-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df4f65c080dc868923e6f07ef46f034d5d379847e1e66248bd936078d67a71c6
MD5 741431441c3cde07cb32465054178010
BLAKE2b-256 823931617e77893ea0bd206828e23098d705c425a60fe9f9ee261f4b5dec8497

See more details on using hashes here.

Provenance

The following attestation bundles were made for athena2duckdb-0.1.1-py3-none-any.whl:

Publisher: publish.yml on sidataplus/athena2duckdb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page