Skip to main content

Embedding extension for OMOP CDM. Utilises sqlite-vec by default and provides an optional pgvector backend, and optionall FAISS export.

Project description

omop-emb

Vector embedding layer for OMOP CDM concepts.

omop-emb generates, stores, and retrieves embeddings for OMOP concepts. It works out of the box with sqlite-vec (no external database required) and scales to PostgreSQL/pgvector for larger deployments. The database is the source of truth — FAISS is an optional read-acceleration sidecar, not a primary store.

Installation

pip install omop-emb                         # sqlite-vec backend (default, no extras needed)
pip install "omop-emb[pgvector]"             # adds PostgreSQL/pgvector support
pip install "omop-emb[faiss-cpu]"            # adds FAISS sidecar support
pip install "omop-emb[pgvector,faiss-cpu]"   # everything

Quick start

Ingest concepts (sqlite-vec, no external service):

export OMOP_EMB_BACKEND=sqlitevec
export OMOP_EMB_SQLITE_PATH=/data/omop_emb.db
export OMOP_CDM_DB_URL=postgresql+psycopg://user:pass@host:5432/omop_cdm

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5

Search:

omop-emb embeddings search --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5 \
    --query "hypertension" --query "type 2 diabetes" \
    --standard-only --domain Condition --k 5

pgvector with HNSW index:

export OMOP_EMB_BACKEND=pgvector
export OMOP_EMB_DB_HOST=localhost
export OMOP_EMB_DB_USER=omop_emb
export OMOP_EMB_DB_PASSWORD=omop_emb
export OMOP_EMB_DB_NAME=omop_emb

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --provider ollama --model nomic-embed-text:v1.5
omop-emb maintenance rebuild-index --model nomic-embed-text:v1.5 --index-type hnsw --metric-type cosine

Environment variables

Variable Default Description
OMOP_EMB_BACKEND sqlitevec Backend: sqlitevec or pgvector.
OMOP_EMB_SQLITE_PATH sqlite-vec database file path (or :memory:).
OMOP_EMB_DB_HOST pgvector: PostgreSQL host.
OMOP_EMB_DB_PORT 5432 pgvector: PostgreSQL port.
OMOP_EMB_DB_USER pgvector: database user.
OMOP_EMB_DB_PASSWORD pgvector: database password.
OMOP_EMB_DB_NAME pgvector: database name.
OMOP_EMB_DB_URL pgvector: full SQLAlchemy URL (overrides individual vars).
OMOP_CDM_DB_URL OMOP CDM connection (required for ingestion commands only).
OMOP_EMB_FAISS_CACHE_DIR Default FAISS cache directory (alternative to --faiss-cache-dir).

See the Configuration Reference for the complete list including asymmetric embedding prefixes and driver overrides.

Documentation

Full documentation: https://AustralianCancerDataNetwork.github.io/omop-emb

Roadmap

  • sqlite-vec backend (default, zero-config)
  • pgvector backend (PostgreSQL)
  • HNSW index support for pgvector
  • FAISS sidecar (approximate nearest-neighbour read acceleration)
  • Embedding bundle export / import CLI (maintenance export, maintenance import, maintenance build-faiss-cache)
  • In-DB concept filtering (domain, vocabulary, standard status, active status)
  • Transparent FAISS fast path in EmbeddingReaderInterface
  • Extensive backend and registry testing
  • FAISS GPU support
  • pgvectorscale support
  • Vector quantisation for more efficient storage

Configuration via oa-configurator

The database connection can also be configured via oa-configurator, which stores settings in ~/.config/omop/config.toml and eliminates the need for environment variables at runtime:

omop-config init
omop-config configure omop_alchemy   # CDM database (required for ingestion)
omop-config configure omop_emb       # embedding database

omop-config configure omop_emb is required for local-dev setup before running the pgvector-backed test suite (CI provisions this automatically). Without it, those tests skip with "Resource 'test_emb_db' not configured" rather than failing.

See oa-configurator Setup for details.


Docker Compose

The included docker-compose.yaml provides both a CDM PostgreSQL database and a pgvector embedding database, plus a Python container with all optional backends pre-installed ([pgvector,faiss-cpu]). Default credentials work out of the box:

docker compose up

Include Ollama by adding the standalone profile:

docker compose --profile standalone up

The python-emb service runs omop-config configure at startup. To override credentials:

cp .env.example .env
docker compose up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omop_emb-1.1.1.tar.gz (212.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omop_emb-1.1.1-py3-none-any.whl (90.5 kB view details)

Uploaded Python 3

File details

Details for the file omop_emb-1.1.1.tar.gz.

File metadata

  • Download URL: omop_emb-1.1.1.tar.gz
  • Upload date:
  • Size: 212.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for omop_emb-1.1.1.tar.gz
Algorithm Hash digest
SHA256 d12dfe82ae64c359cc98aee0d09ccdfd29b559755bedc4ceb08c8666c6058cf6
MD5 1d4e7e5a26ee660f7805e0a736558e69
BLAKE2b-256 7344f2081ffd9e301c8dcd488b9af9fd278764ce0e51a29e7e8581f778cbf65c

See more details on using hashes here.

File details

Details for the file omop_emb-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: omop_emb-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 90.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for omop_emb-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 34d1e28d9e90c9872b7e2def6b539c4787e5204c119cc0fade5d1dacba17427a
MD5 52d635d3aba93973802c1c0a7717b9b0
BLAKE2b-256 352986d15d940ef56ca948fb02c049c681e3f5c6bd9d43281bf44455cb57b244

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page