Skip to main content

Minimal indexing and inference toolkit for terminology mapping.

Project description

THIRAWAT Mapper

Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers

Prerequisites

Environment

  1. Python 3.10+
  2. uv

OHDSI Standard Concepts

  1. Request and download the standard concepts in csv format from https://athena.ohdsi.org/
  2. Convert the csv files into a DuckDB database using sidataplus/athena2duckdb

Models

Fine-tuned reranker models are hosted on Hugging Face.

Pre-built indexes will be made available soon.

Install from PyPI

pip install thirawat-mapper
# or (recommended for global CLI installs)
pipx install thirawat-mapper

thirawat --help

Command mapping:

  • thirawat index build ...python -m thirawat_mapper.index.build ...
  • thirawat infer bulk ...python -m thirawat_mapper.infer.bulk ...
  • thirawat infer query ...python -m thirawat_mapper.infer.query ...

1. Build a LanceDB Index

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --domain-id Drug \
  --concept-class-id "Clinical Drug,Quant Clinical Drug,Clinical Drug Comp,Clinical Drug Form,Ingredient" \
  --exclude-concept-class-id "Clinical Drug Box,Branded Drug Box,Branded Pack Box,Clinical Pack Box,Marketed Product,Quant Branded Box,Quant Clinical Box" \
  --extra-column "concept_name,domain_id,vocabulary_id,concept_class_id" \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --batch-size 256 \
  --device cuda

Key options:

  • --duckdb - DuckDB file produced by sidataplus/athena2duckdb.
  • --profiles-table - Preferred table containing concept_id and profile_text. If the table is missing, the builder falls back to generating profiles inline from concept (and concept_synonym when available).
  • --concepts-table - OMOP concept table (defaults to concept). The builder always joins to this table and keeps only standard, valid concepts (standard_concept = 'S' AND invalid_reason IS NULL).
  • --domain-id, --concept-class-id - Optional filters; accept comma-separated lists or repeated flags.
  • --exclude-concept-class-id - Exclude specific classes (comma-separated or repeat flag). Default empty; recommended exclusions: Clinical Drug Box, Branded Drug Box, Branded Pack Box, Clinical Pack Box, Marketed Product, Quant Branded Box, Quant Clinical Box.
  • --extra-column - Carry additional columns from the profiles table into LanceDB (repeat flag).
  • --max-synonyms - Number of synonyms appended when inline profile generation is used.
  • --include-codes-in-text - Include concept_code in generated inline profile text.
  • --model-id, --pooling, --max-length - Encoder controls for building the index vectors (also written into the index manifest for inference defaults).
  • --out-db / --table - Target LanceDB directory and table name.

If your Athena-to-DuckDB file does not contain a concept_profiles table, the command still works via inline profile generation:

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --max-synonyms 3 \
  --include-codes-in-text

Device matrix:

  • index build --device: explicit cuda|mps|cpu. If omitted, the encoder uses cuda when available, otherwise cpu.
  • infer bulk/query --device: auto|cuda|mps|cpu (default cpu for stability; auto prefers cuda, then mps, then cpu).

Apple Silicon example:

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --device mps

The command will:

  1. Load profiles (and apply filters if provided).
  2. Normalize profile_text and embed with SapBERT vectors (via transformers; pooling configurable).
  3. Write a LanceDB table where vector is a FixedSizeList<float32>[768] column.
  4. Emit a <table>_manifest.json manifest describing the build (model id, filters, counts).

2. Interactive Query (REPL)

thirawat infer query \
  --db data/lancedb/db \
  --table concepts_drug \
  --device cpu \
  --reranker-id sidataplus/THIRAWAT-BioLORD  # optional override; defaults to sidataplus/THIRAWAT-SapBERT

Type a query and press Enter to see the post-scored top results:

query> amoxicillin clavulanate 875 mg
concept_id   | score  | s_sim | name
--------------------------------------------------------------------------------
123456       | 0.841  | 0.990 | Amoxicillin / Clavulanate 875 MG Oral Tablet
...

Commands:

  • Type :q, :quit, or :exit to leave.
  • Use --candidate-topk to change the candidate pool and --show-topk to limit display rows.
  • --reranker-id works here too if you want to test a local or alternative reranker in the REPL.

3. Bulk Inference

export TOKENIZERS_PARALLELISM=false

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/usagi.csv \
  --out runs/mapping \
  --candidate-topk 200 \
  --n-limit 20 \
  --device cuda

Add --reranker-id to point at a different reranker checkpoint. The flag accepts either a Hugging Face model ID or a local path, e.g. --reranker-id models/nde_biolord.

Input formats: CSV, TSV, Parquet, or Excel. By default the CLI expects the following columns (override via flags):

  • sourceName (required)
  • sourceCode (optional)
  • conceptId (optional ground truth)
  • mappingStatus (used for Usagi detection). When the input already follows the Usagi CSV schema (see data/eval/tmt_to_rxnorm.csv), the CLI validates a sample of rows through a Pydantic schema and surfaces a clear error if the structure is invalid. Otherwise, it synthesizes a minimal Usagi row per record so downstream exports stay consistent.

Selected flags:

  • --source-name-column, --source-code-column - Override input headers.
  • --label-column - Column containing gold concept IDs (optional, default conceptId).
  • --status-column, --approved-value - Configure Usagi approval detection.
  • --batch-size - Query embedding batch size (increase for better GPU throughput).
  • --n-limit - Limit to the first N rows (smoke runs).
  • --where - Optional LanceDB filter, e.g., vocabulary_id = 'RxNorm' AND concept_class_id != 'Ingredient' (when those columns exist in the index).
  • --device - auto|cuda|mps|cpu (default cpu for stability; use auto to prefer cuda, then mps, then cpu).
  • --encoder-model-id, --encoder-pooling, --encoder-max-length - Override the query encoder used for retrieval (defaults to the index manifest when present).
  • --post-mode - Post-score behavior: blend|tiebreak|lex (default tiebreak).
  • --post-weight - Blend weight (only when --post-mode blend, default 0.05).
  • --tiebreak-eps, --tiebreak-topn - Controls near-tie grouping for --post-mode tiebreak.
  • --brand-strict - For bracketed brand queries, drop brand-mismatched candidates when possible.
  • --inn2usan/--no-inn2usan - Normalize INN/BAN drug names to USAN during inference (default enabled).
  • --atc-scope - Boost candidates matching per-row atc_ids/atc_codes (requires --vocab or a DuckDB path in the index manifest).
  • --reranker-id - Override the default reranker (sidataplus/THIRAWAT-SapBERT) with another HF model ID or a local directory/filename. Relative paths are resolved to absolute paths so you can pass models/nde_biolord.

Deterministic post-ranking modes

--post-mode controls how post features influence ranking:

  • tiebreak (default): keeps the ML relevance ordering globally, but reorders only near-tied candidates (gap <= --tiebreak-eps) within the first --tiebreak-topn rows.
  • lex: full lexicographic sort by relevance + post features across all rows.
  • blend: computes a weighted final score.

For blend, the score is:

final_score = (1 - post_weight) * relevance + post_weight * post_score

For lex and tiebreak, tie-break keys are applied in this deterministic order (descending):

  1. brand_strength_exact
  2. top20_strength_form_exact
  3. brand_score
  4. rerank_top20
  5. strength_exact
  6. strength_sim
  7. form_route_score
  8. release_score

Pipeline steps per row:

  1. Build query text (sourceName with sourceCode appended in parentheses when present).
  2. Embed with SapBERT.
  3. Vector search (cosine) against the LanceDB table to gather --candidate-topk entries.
  4. Rerank with the THIRAWAT reranker. Beta is vector-only; no FTS/BM25/hybrid.
  5. Apply post-scoring per --post-mode (default tiebreak: only reorders within near-ties of the ML score). Disable post-scoring via --post-mode blend --post-weight 0.0.

Outputs (written to --out):

  • results.csv - Classic relabel layout (wide, block-per-query). Columns: leading rank 1..K, then for each query three adjacent columns [match_rank_or_unmatched, source_concept_name, source_concept_code] with K rows beneath. Non-Usagi inputs preserve the original row order; Usagi inputs continue to sort matched rows first so reviewers can focus on confirmed gold IDs.
  • results_with_input.csv - Original input row with candidate columns appended.
  • results_usagi.csv - Always emitted. Each processed row is coerced into the Usagi schema (using the sample in data/eval/tmt_to_rxnorm.csv as ground truth). The top candidate populates conceptId, conceptName, domainId, and matchScore when available; otherwise those fields remain blank. Every row is marked mappingStatus=UNCHECKED, statusSetBy=THIRAWAT-mapper, mappingType=MAPS_TO so reviewers can import the file directly into Usagi even when the source sheet was not originally in that format.
  • metrics.json - When ground-truth IDs are available (either via conceptId or Usagi rows with mappingStatus == APPROVED) the file reports Hit@{1,2,5,10,20,50,100}, MRR@100, coverage, and counts.

LLM-assisted RAG reranking

Bulk inference can optionally send the top reranked candidates to an LLM for tie-breaking or abstention logic. Enable this flow with --rag-provider and supply provider-specific flags. The CLI saves every prompt/response pair to rag_prompts.md under the chosen --out directory so you can audit exactly what was sent.

LLM output must be structured JSON with a concept_ids array, e.g. {"concept_ids":[123,456,789]}. If a provider returns invalid JSON for a query, that query falls back to the non-LLM ranking and logs an error.

General RAG knobs:

--rag-provider {ollama,llamacpp,openrouter,cloudflare}
--rag-model MODEL_ID                # default openai/gpt-oss-20b
--rag-candidate-limit 50            # number of reranked candidates passed to the LLM
--rag-profile-char-limit 512        # truncate long profile_text snippets
--rag-include-retrieval-score/--no-rag-include-retrieval-score
--rag-include-final-score/--no-rag-include-final-score
--rag-extra-context-column COLUMN   # optional extra context column from the input sheet
--rag-stop-sequence TEXT (repeatable)
--rag-use-normalized-query/--no-rag-use-normalized-query

Tip: RAG is isolated to infer.bulk. The interactive REPL intentionally remains retrieval-only in this beta.

Ollama (local GGUF/chat server)

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/ollama_rag \
  --n-limit 100 \
  --rag-provider ollama \
  --ollama-base-url http://localhost:11434 \
  --ollama-model "gpt-oss:20b"

Ollama-specific flags:

--ollama-base-url URL          # default http://localhost:11434
--ollama-model MODEL_TAG       # defaults to --rag-model value
--ollama-timeout 120           # seconds
--ollama-keep-alive "5m"       # optional keep-alive hint sent to server

llama.cpp server (local HTTP API)

Use --rag-provider llamacpp only when a llama.cpp llama-server process is already running (default http://127.0.0.1:8080). Launch the server separately with your desired context and batching flags (for example: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -fa on). Point the CLI at that HTTP endpoint, not at GGUF files directly:

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/llamacpp_rag \
  --rag-provider llamacpp \
  --llamacpp-base-url http://127.0.0.1:8080 \
  --rag-model ggml-org/gpt-oss-20b-GGUF

llama.cpp flags:

--llamacpp-base-url URL          # default http://127.0.0.1:8080
--llamacpp-timeout 120           # HTTP timeout in seconds
--llamacpp-chat-format FORMAT    # e.g., qwen, llama
--llamacpp-system-prompt TEXT    # optional instruction prefix
--llamacpp-n-ctx 8192            # forwarded via query parameters when supported
--llamacpp-model-path /path/model.gguf   # fallback to llama-cpp-python bindings when no base URL is set

If you omit --llamacpp-base-url, the CLI falls back to the python bindings and expects --llamacpp-model-path to point to a local GGUF file (plus any --llamacpp-n-* overrides). In that mode, the rag-model flag is ignored and the file name controls which model loads.

For all providers, the CLI logs each prompt/response pair and the parsed candidate ordering to rag_prompts.md in the --out directory for downstream review.

OpenRouter (hosted multi-model API)

export OPENROUTER_API_KEY=<YOUR_KEY>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/openrouter_rag \
  --rag-provider openrouter \
  --rag-model openrouter/polaris-alpha

Set OPENROUTER_API_KEY in your environment; the CLI will refuse to call OpenRouter without it.

Cloudflare Workers AI (remote)

export CLOUDFLARE_ACCOUNT_ID=<ACCOUNT_ID>
export CLOUDFLARE_API_TOKEN=<API_TOKEN>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/cf_rag \
  --n-limit 100 \
  --rag-provider cloudflare \
  --rag-model openai/gpt-oss-20b

Cloudflare-specific flags:

--cloudflare-base-url https://api.cloudflare.com/client/v4
--cloudflare-use-responses-api / --no-cloudflare-use-responses-api
--gpt-reasoning-effort {low,medium,high}
--cf-reasoning-summary {auto,concise,detailed}

Set CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in your environment before invoking the Cloudflare provider; the CLI reads only from those variables.

  • Models under @cf/openai/* (for example @cf/openai/gpt-oss-120b) use the Workers AI Responses API, so leave --cloudflare-use-responses-api enabled to send the prompt as an input payload.
  • Meta's @cf/meta/llama-4-* family is served via the /ai/run/<model> endpoint; pass --no-cloudflare-use-responses-api when targeting those models so the CLI emits the messages payload the endpoint expects.

Development

# 1. Install dependencies into a local virtual environment (creates .venv/)
uv sync

# 2. (Optional) Activate the environment for interactive shells
source .venv/bin/activate

# 3. Or just run commands directly via uv
uv run python -m thirawat_mapper.index.build --help

uv sync reads the project metadata and installs the required packages (PyTorch, LanceDB, transformers, etc.) against Python 3.10+. Subsequent uv run ... invocations will reuse the same environment. Replace paths in the examples below to match your workspace. All text used for indexing and inference is normalized (lower-cased, whitespace collapsed) for stable matching.

Notes & Requirements

  • Vector-only retrieval + reranking (no FTS/BM25/hybrid in beta).
  • Text is normalized (lowercase + collapsed whitespace) for indexing and inference.
  • The reranker default is sidataplus/THIRAWAT-SapBERT. As verified on February 10, 2026 via the Hugging Face model API, this model is public (gated=false, private=false). If upstream access settings change later, authenticate with Hugging Face as needed.
  • LanceDB tables must expose a float32 fixed-size vector column (named vector when built with this CLI).
  • Index build keeps only standard, valid OMOP concepts (standard_concept='S' AND invalid_reason IS NULL).
  • This beta uses the transformers encoder path directly (no --backend st switch in this CLI).

Troubleshooting: SapBERT warning during startup

You may see this warning while loading SapBERT-related components:

No sentence-transformers model found with name cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR.

In this project that warning is often benign fallback behavior when loading through transformers/ColBERT wrappers. Treat it as an error only when model loading or inference actually fails (for example, a raised exception, process exit, or no embeddings produced).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thirawat_mapper-0.2.1.tar.gz (69.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thirawat_mapper-0.2.1-py3-none-any.whl (71.5 kB view details)

Uploaded Python 3

File details

Details for the file thirawat_mapper-0.2.1.tar.gz.

File metadata

  • Download URL: thirawat_mapper-0.2.1.tar.gz
  • Upload date:
  • Size: 69.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for thirawat_mapper-0.2.1.tar.gz
Algorithm Hash digest
SHA256 78f6d7830f28aa9997b52eb528d4e4e572eb6749c938d373455379e6a107c589
MD5 431b2e43fc0d188a79550ce777b43c39
BLAKE2b-256 c22ee887fd0d919a6c96c517cd08591b9169a31decdc6e00056e670deb4c95f1

See more details on using hashes here.

File details

Details for the file thirawat_mapper-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for thirawat_mapper-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3206bd8557fcbf2b99aab9a204caa4793361c3782af6e9f563dd07f39c03e827
MD5 4fb927064e541f506d82e7c948e7f828
BLAKE2b-256 28f2628c74a33b0cfe1888390b93c47dd0c39e54f78448e181fb5151b76d83c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page