Skip to main content

Embedding extension for OMOP CDM. Utilises sqlite-vec by default and provides an optional pgvector backend, and optionall FAISS export.

Project description

omop-emb

Vector embedding layer for OMOP CDM concepts.

omop-emb generates, stores, and retrieves embeddings for OMOP concepts. It works out of the box with sqlite-vec (no external database required) and scales to PostgreSQL/pgvector for larger deployments. The database is the source of truth — FAISS is an optional read-acceleration sidecar, not a primary store.

Installation

pip install omop-emb                         # sqlite-vec backend (default, no extras needed)
pip install "omop-emb[pgvector]"             # adds PostgreSQL/pgvector support
pip install "omop-emb[faiss-cpu]"            # adds FAISS sidecar support
pip install "omop-emb[pgvector,faiss-cpu]"   # everything

Quick start

Ingest concepts (sqlite-vec, no external service):

export OMOP_EMB_BACKEND=sqlitevec
export OMOP_EMB_SQLITE_PATH=/data/omop_emb.db
export OMOP_CDM_DB_URL=postgresql+psycopg://user:pass@host:5432/omop_cdm

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5

Search:

omop-emb embeddings search --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5 \
    --query "hypertension" --query "type 2 diabetes" \
    --standard-only --domain Condition --k 5

pgvector with HNSW index:

export OMOP_EMB_BACKEND=pgvector
export OMOP_EMB_DB_HOST=localhost
export OMOP_EMB_DB_USER=omop_emb
export OMOP_EMB_DB_PASSWORD=omop_emb
export OMOP_EMB_DB_NAME=omop_emb

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5
omop-emb maintenance rebuild-index --model nomic-embed-text:v1.5 --index-type hnsw --metric-type cosine

Environment variables

Variable Default Description
OMOP_EMB_BACKEND sqlitevec Backend: sqlitevec or pgvector.
OMOP_EMB_SQLITE_PATH sqlite-vec database file path (or :memory:).
OMOP_EMB_DB_HOST pgvector: PostgreSQL host.
OMOP_EMB_DB_PORT 5432 pgvector: PostgreSQL port.
OMOP_EMB_DB_USER pgvector: database user.
OMOP_EMB_DB_PASSWORD pgvector: database password.
OMOP_EMB_DB_NAME pgvector: database name.
OMOP_EMB_DB_URL pgvector: full SQLAlchemy URL (overrides individual vars).
OMOP_CDM_DB_URL OMOP CDM connection (required for ingestion commands only).
OMOP_EMB_FAISS_CACHE_DIR Default FAISS cache directory (alternative to --faiss-cache-dir).

See the Configuration Reference for the complete list including asymmetric embedding prefixes and driver overrides.

Documentation

Full documentation: https://AustralianCancerDataNetwork.github.io/omop-emb

Roadmap

  • sqlite-vec backend (default, zero-config)
  • pgvector backend (PostgreSQL)
  • HNSW index support for pgvector
  • FAISS sidecar (approximate nearest-neighbour read acceleration)
  • Embedding bundle export / import CLI (maintenance export, maintenance import, maintenance build-faiss-cache)
  • In-DB concept filtering (domain, vocabulary, standard status, active status)
  • Transparent FAISS fast path in EmbeddingReaderInterface
  • Extensive backend and registry testing
  • FAISS GPU support
  • pgvectorscale support
  • Vector quantisation for more efficient storage

Configuration via oa-configurator

The database connection can also be configured via oa-configurator, which stores settings in ~/.config/omop/config.toml and eliminates the need for environment variables at runtime:

omop-config init
omop-config configure omop_alchemy   # CDM database (required for ingestion)
omop-config configure omop_emb       # embedding database

omop-config configure omop_emb is required for local-dev setup before running the pgvector-backed test suite (CI provisions this automatically). Without it, those tests skip with "Resource 'test_emb_db' not configured" rather than failing.

See oa-configurator Setup for details.


Docker Compose

The included docker-compose.yaml provides both a CDM PostgreSQL database and a pgvector embedding database, plus a Python container with all optional backends pre-installed ([pgvector,faiss-cpu]). Default credentials work out of the box:

docker compose up

Include Ollama by adding the standalone profile:

docker compose --profile standalone up

The python-emb service runs omop-config configure at startup. To override credentials:

cp .env.example .env
docker compose up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omop_emb-1.1.0.tar.gz (212.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omop_emb-1.1.0-py3-none-any.whl (90.5 kB view details)

Uploaded Python 3

File details

Details for the file omop_emb-1.1.0.tar.gz.

File metadata

  • Download URL: omop_emb-1.1.0.tar.gz
  • Upload date:
  • Size: 212.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for omop_emb-1.1.0.tar.gz
Algorithm Hash digest
SHA256 97a69d8854b204408c8b6dd46d45ea975347cb5b9b30bd22f0001b9542bca4d5
MD5 93df5b25278f8b5a1b38b8ec9e58adfe
BLAKE2b-256 a5b3ea70788c73834e30774ef9c0986727a9b22f3e4c4cd8c966a217e2e64bb2

See more details on using hashes here.

File details

Details for the file omop_emb-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: omop_emb-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 90.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for omop_emb-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ace850b8a05ae7ea2c319c80ae6bfe3061ae401521df38aef3c270274da220d0
MD5 080c9439b950e7a192a07a668ed42b29
BLAKE2b-256 0a5170ea39e3b0ec41ff7ca71d54cc09ef239ee6a04d33feb3cff2806ab084bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page