Skip to main content

A tool to convert activities into biomedical concepts

Project description

USDM Biomedical Concept Mapper

License: MIT Python Build Status

The first production-ready agent that maps biomedical concepts to the CDISC Unified Study Data Model using either commercial or open-source LLMs.

USDM Biomedical Concept Mapper is purpose-built for clinical teams who need accurate, explainable alignment between study artefacts and CDISC biomedical concepts. Our agentic workflow discovers, validates, and maps concepts end-to-end—without black-box pipelines or heavyweight infrastructure.

Table of Contents

What does this project do?

The USDM Biomedical Concept Mapper helps identify biomedical concepts for activities in USDM files:

  • Automated Mapping: Maps activities from USDM files to standardized biomedical concepts
  • AI-Powered Search: Uses Large Language Models (LLMs) to find the best matching CDISC concepts for given activities
  • CDISC Integration: Utilizes the latest CDISC biomedical concepts and SDTM dataset specializations
  • Batch Processing: Processes entire USDM study files and generates mapped outputs

Key Features

  • Multiple Search Methods: Supports both LLM-powered exact matching and local index searching
  • Configurable AI Models: Supports different commercial or open-source LLMs
  • Command Line Interface: Easy-to-use CLI for batch processing and individual concept searches

Architecture Overview

Our self-steering agent keeps the stack intentionally simple while delivering state-of-the-art accuracy. Every run goes through three explainable steps:

%%{init: {'theme':'neutral', 'flowchart': {'htmlLabels': true, 'curve': 'basis', 'wrap': true}}}%%
flowchart TB
    subgraph L1[Context & Retrieval]
        direction LR
        A["🧩 USDM Activity Context"] --> B["① Dynamic Retrieval<br/>🔍 bm25s + Polars index"] --> C{{"Top-K Concept Set"}}
    end

    subgraph L2[Reasoning & Delivery]
        direction LR
        D["② LLM Reasoning<br/>🤖 Commercial / open-source model"] --> E["③ USDM Mapping<br/>🛠️ SDK + deterministic writer"] --> F["📄 Downstream applications"]
    end

    C --> D

    classDef context fill:#E1F5FE,stroke:#0277BD,color:#01579B,font-weight:bold;
    classDef process fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20;
    classDef decision fill:#FFF3E0,stroke:#EF6C00,color:#E65100,font-weight:bold;
    classDef output fill:#EDE7F6,stroke:#512DA8,color:#311B92,font-weight:bold;

    class A context
    class B,D,E process
    class C decision
    class F output

Step 1 – Dynamic Retrieval

CdiscBcIndex builds a lightweight BM25 retriever over the latest CDISC biomedical concepts and SDTM specializations. For every prompt from the agent, it dynamically generates retrieval queries and returns a concise, templated synopsis of the top candidates.

Step 2 – LLM Reasoning

find_biomedical_concept iteratively calls your chosen LLM (e.g. OpenAI, Gemini, or self-hosted open-weight models) to reason over the candidate set. The agent loop is implemented is transparent, providing explainable decision traces for each search.

Step 3 – Deterministic USDM Mapping

map_biomedical_concepts enriches the USDM wrapper with the validated biomedical concept IDs, properties, and codelists. The mapper ensures every activity is linked to the correct CDISC structures and writes a ready-to-share JSON package.

Why it works: Retrieval keeps the LLM grounded, the reasoning step is model-agnostic, and the mapper uses CDISC-native schemas so teams get a fast, auditable pipeline without orchestrating multiple services.

Robust Design Principles

  • Flexible LLM Integration: Designed to work reliably with both enterprise APIs and local open-source models, providing teams with deployment flexibility and avoiding vendor lock-in.
  • Transparent Processing Pipeline: Features a clear three-step workflow (retrieve, reason, map) with comprehensive audit trails and explainable decision logs for regulatory compliance.
  • Production-Ready Architecture: Built on proven technologies including Polars for data processing, BM25 for retrieval, and the official USDM SDK—ensuring stability and maintainability.
  • Reliable Performance: Engineered for consistent, traceable mappings that support regulatory workflows and CDISC metadata requirements.

Installation

Prerequisites

  • Python 3.13 or higher
  • Access to LLM (commercial or open-source)

Install from PyPI

pip install usdm-bc-mapper

Quick Start

  1. Install the package:

    pip install usdm-bc-mapper
    
  2. Create a config file (config.yaml) in your working directory:

    llm_api_key: "your-api-key-here"
    llm_model: "gpt-5-mini"
    
  3. Run the mapper on your USDM file:

    bcm usdm your_study.json
    
  4. Get help with any command:

    bcm --help
    bcm usdm --help
    

Configuration

Before using the tool, you need to configure your settings. Create a config.yaml file in your working directory (the same directory where your USDM JSON file is located):

# config.yaml
llm_api_key: "your-api-key-here"
llm_model: "gpt-5-mini" # or your preferred model

# Optional Configurations
llm_base_url: "https://api.openai.com/v1" # or your custom endpoint
max_ai_lookup_attempts: 7 # max retries for AI lookup
data_path: "path/to/cdisc/data" # path to CDISC data files and system prompt for LLMs
data_search_cols: # columns to search in CDISC data
  - "short_name"
  - "bc_categories"
  - "synonyms"
  - "definition"

AI SDK Compatibility

The mapper speaks the OpenAI-compatible API spec, so you can bring your own provider. See docs/ai_sdk_compatibility.md for the full walkthrough; the quick-start presets are below.

Commercial APIs

llm_base_url: "https://api.openai.com/v1"
llm_api_key: "sk-your-api-key"
llm_model: "gpt-5-mini"

Open-weight Aggregators

llm_base_url: "https://openrouter.ai/api/v1"
llm_api_key: "sk-or-your-key"
llm_model: "meta-llama/llama-3.1-8b-instruct:free"

Self-hosted Runtimes

llm_base_url: "http://localhost:11434/v1"
llm_api_key: "not-needed"
llm_model: "phi4"

Need hardware tips and server commands? Jump to docs/ai_sdk_compatibility.md.

Command Line Usage

The tool provides three main commands through the bcm CLI. Use bcm --help or bcm <command> --help to see detailed documentation for each command.

1. Map USDM File Biomedical Concepts

Map all biomedical concepts in a USDM file to CDISC standards:

bcm usdm path/to/your/usdm_file.json --config config.yaml

With custom output file:

bcm usdm path/to/your/usdm_file.json --output mapped_results.json --config config.yaml

2. Find Individual Biomedical Concept

Find CDISC match for a specific biomedical concept using LLM (provides exact matching):

bcm find-bc-cdisc "diabetes mellitus" --config config.yaml

3. Search CDISC Biomedical Concepts

Search the local CDISC index for matching concepts (searches local index without LLM):

bcm search-bc-cdisc "blood pressure" --config config.yaml

Search with custom number of results:

bcm search-bc-cdisc "blood pressure" --k 20 --config config.yaml

Note: The main difference between find-bc-cdisc and search-bc-cdisc is that find-bc-cdisc uses an LLM to find exact matches, while search-bc-cdisc looks for matches in the local index.

Advanced Usage

Enable Debug Logging

Add the --show-logs flag to any command to see detailed processing information:

bcm usdm path/to/file.json --config config.yaml --show-logs

Output Examples

USDM Mapping Output

When using bcm usdm, the tool outputs the original USDM data with mapped CDISC biomedical concepts, including confidence scores and reasoning in structured JSON format.

Individual Concept Search Output

When using bcm find-bc-cdisc or bcm search-bc-cdisc, the tool returns matched CDISC concept details with relevance scores.

Development

Development Setup

Clone the project:

git clone https://github.com/AI-LENS/usdm-bc-mapper.git

Go to the project directory:

cd usdm-bc-mapper

Install dependencies:

uv sync --group dev

Running Tests

pytest

Pre-commit Hooks

Install pre-commit hooks for code quality:

pre-commit install
pre-commit run --all-files

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Support

For questions or issues, please open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usdm_bc_mapper-0.3.2.tar.gz (18.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

usdm_bc_mapper-0.3.2-py3-none-any.whl (18.8 MB view details)

Uploaded Python 3

File details

Details for the file usdm_bc_mapper-0.3.2.tar.gz.

File metadata

  • Download URL: usdm_bc_mapper-0.3.2.tar.gz
  • Upload date:
  • Size: 18.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for usdm_bc_mapper-0.3.2.tar.gz
Algorithm Hash digest
SHA256 0dbe9fbc6090a5e2c4b36ac1ed3610c7f12dda992c110edaf113f41c811e1f1e
MD5 6f7860cc8976dd9f18185791f2b85c0a
BLAKE2b-256 c74d528732303a3f29c70a430cf3aff527b78b11da05243cada2c36fd3fd31af

See more details on using hashes here.

Provenance

The following attestation bundles were made for usdm_bc_mapper-0.3.2.tar.gz:

Publisher: pypi_release.yml on AI-LENS/usdm-bc-mapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file usdm_bc_mapper-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: usdm_bc_mapper-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 18.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for usdm_bc_mapper-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8f73276a845d4b3ae8363afcf87604515dfb281f3351fc2fd58ef9559cff47b7
MD5 cab7d6b6a687892012b1376e987e18e9
BLAKE2b-256 4f77c896f4e32dedbb861fc7c634e19b39ca01d11c68f9473d57e6171a3ac2a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for usdm_bc_mapper-0.3.2-py3-none-any.whl:

Publisher: pypi_release.yml on AI-LENS/usdm-bc-mapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page