A tool to convert activities into biomedical concepts
Project description
USDM Biomedical Concept Mapper
The first production-ready agent that maps biomedical concepts to the CDISC Unified Study Data Model using either commercial or open-source LLMs.
USDM Biomedical Concept Mapper is purpose-built for clinical teams who need accurate, explainable alignment between study artefacts and CDISC biomedical concepts. Our agentic workflow discovers, validates, and maps concepts end-to-end—without black-box pipelines or heavyweight infrastructure.
Table of Contents
- What does this project do?
- Architecture Overview
- Robust Design Principles
- Installation
- Quick Start
- Configuration
- AI SDK Compatibility
- Command Line Usage
- Advanced Usage
- Output Examples
- Development
- Contributing
- License
- Support
What does this project do?
The USDM Biomedical Concept Mapper helps identify biomedical concepts for activities in USDM files:
- Automated Mapping: Maps activities from USDM files to standardized biomedical concepts
- AI-Powered Search: Uses Large Language Models (LLMs) to find the best matching CDISC concepts for given activities
- CDISC Integration: Utilizes the latest CDISC biomedical concepts and SDTM dataset specializations
- Batch Processing: Processes entire USDM study files and generates mapped outputs
Key Features
- Multiple Search Methods: Supports both LLM-powered exact matching and local index searching
- Configurable AI Models: Supports different commercial or open-source LLMs
- Command Line Interface: Easy-to-use CLI for batch processing and individual concept searches
Architecture Overview
Our self-steering agent keeps the stack intentionally simple while delivering state-of-the-art accuracy. Every run goes through three explainable steps:
%%{init: {'theme':'neutral', 'flowchart': {'htmlLabels': true, 'curve': 'basis', 'wrap': true}}}%%
flowchart TB
subgraph L1[Context & Retrieval]
direction LR
A["🧩 USDM Activity Context"] --> B["① Dynamic Retrieval<br/>🔍 bm25s + Polars index"] --> C{{"Top-K Concept Set"}}
end
subgraph L2[Reasoning & Delivery]
direction LR
D["② LLM Reasoning<br/>🤖 Commercial / open-source model"] --> E["③ USDM Mapping<br/>🛠️ SDK + deterministic writer"] --> F["📄 Downstream applications"]
end
C --> D
classDef context fill:#E1F5FE,stroke:#0277BD,color:#01579B,font-weight:bold;
classDef process fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20;
classDef decision fill:#FFF3E0,stroke:#EF6C00,color:#E65100,font-weight:bold;
classDef output fill:#EDE7F6,stroke:#512DA8,color:#311B92,font-weight:bold;
class A context
class B,D,E process
class C decision
class F output
Step 1 – Dynamic Retrieval
CdiscBcIndex builds a lightweight BM25 retriever over the latest CDISC biomedical concepts and SDTM specializations. For every prompt from the agent, it dynamically generates retrieval queries and returns a concise, templated synopsis of the top candidates.
Step 2 – LLM Reasoning
find_biomedical_concept iteratively calls your chosen LLM (e.g. OpenAI, Gemini, or self-hosted open-weight models) to reason over the candidate set. The agent loop is implemented is transparent, providing explainable decision traces for each search.
Step 3 – Deterministic USDM Mapping
map_biomedical_concepts enriches the USDM wrapper with the validated biomedical concept IDs, properties, and codelists. The mapper ensures every activity is linked to the correct CDISC structures and writes a ready-to-share JSON package.
Why it works: Retrieval keeps the LLM grounded, the reasoning step is model-agnostic, and the mapper uses CDISC-native schemas so teams get a fast, auditable pipeline without orchestrating multiple services.
Robust Design Principles
- Flexible LLM Integration: Designed to work reliably with both enterprise APIs and local open-source models, providing teams with deployment flexibility and avoiding vendor lock-in.
- Transparent Processing Pipeline: Features a clear three-step workflow (retrieve, reason, map) with comprehensive audit trails and explainable decision logs for regulatory compliance.
- Production-Ready Architecture: Built on proven technologies including Polars for data processing, BM25 for retrieval, and the official USDM SDK—ensuring stability and maintainability.
- Reliable Performance: Engineered for consistent, traceable mappings that support regulatory workflows and CDISC metadata requirements.
Installation
Prerequisites
- Python 3.13 or higher
- Access to LLM (commercial or open-source)
Install from PyPI
pip install usdm-bc-mapper
Quick Start
-
Install the package:
pip install usdm-bc-mapper
-
Create a config file (
config.yaml) in your working directory:llm_api_key: "your-api-key-here" llm_model: "gpt-5-mini"
-
Run the mapper on your USDM file:
bcm usdm your_study.json
-
Get help with any command:
bcm --help bcm usdm --help
Configuration
Before using the tool, you need to configure your settings. Create a config.yaml file in your working directory (the same directory where your USDM JSON file is located):
# config.yaml
llm_api_key: "your-api-key-here"
llm_model: "gpt-5-mini" # or your preferred model
# Optional Configurations
llm_base_url: "https://api.openai.com/v1" # or your custom endpoint
max_ai_lookup_attempts: 7 # max retries for AI lookup
data_path: "path/to/cdisc/data" # path to CDISC data files and system prompt for LLMs
data_search_cols: # columns to search in CDISC data
- "short_name"
- "bc_categories"
- "synonyms"
- "definition"
AI SDK Compatibility
The mapper speaks the OpenAI-compatible API spec, so you can bring your own provider. See docs/ai_sdk_compatibility.md for the full walkthrough; the quick-start presets are below.
Commercial APIs
llm_base_url: "https://api.openai.com/v1"
llm_api_key: "sk-your-api-key"
llm_model: "gpt-5-mini"
Open-weight Aggregators
llm_base_url: "https://openrouter.ai/api/v1"
llm_api_key: "sk-or-your-key"
llm_model: "meta-llama/llama-3.1-8b-instruct:free"
Self-hosted Runtimes
llm_base_url: "http://localhost:11434/v1"
llm_api_key: "not-needed"
llm_model: "phi4"
Need hardware tips and server commands? Jump to
docs/ai_sdk_compatibility.md.
Command Line Usage
The tool provides three main commands through the bcm CLI. Use bcm --help or bcm <command> --help to see detailed documentation for each command.
1. Map USDM File Biomedical Concepts
Map all biomedical concepts in a USDM file to CDISC standards:
bcm usdm path/to/your/usdm_file.json --config config.yaml
With custom output file:
bcm usdm path/to/your/usdm_file.json --output mapped_results.json --config config.yaml
2. Find Individual Biomedical Concept
Find CDISC match for a specific biomedical concept using LLM (provides exact matching):
bcm find-bc-cdisc "diabetes mellitus" --config config.yaml
3. Search CDISC Biomedical Concepts
Search the local CDISC index for matching concepts (searches local index without LLM):
bcm search-bc-cdisc "blood pressure" --config config.yaml
Search with custom number of results:
bcm search-bc-cdisc "blood pressure" --k 20 --config config.yaml
Note: The main difference between find-bc-cdisc and search-bc-cdisc is that find-bc-cdisc uses an LLM to find exact matches, while search-bc-cdisc looks for matches in the local index.
Advanced Usage
Enable Debug Logging
Add the --show-logs flag to any command to see detailed processing information:
bcm usdm path/to/file.json --config config.yaml --show-logs
Output Examples
USDM Mapping Output
When using bcm usdm, the tool outputs the original USDM data with mapped CDISC biomedical concepts, including confidence scores and reasoning in structured JSON format.
Individual Concept Search Output
When using bcm find-bc-cdisc or bcm search-bc-cdisc, the tool returns matched CDISC concept details with relevance scores.
Development
Development Setup
Clone the project:
git clone https://github.com/AI-LENS/usdm-bc-mapper.git
Go to the project directory:
cd usdm-bc-mapper
Install dependencies:
uv sync --group dev
Running Tests
pytest
Pre-commit Hooks
Install pre-commit hooks for code quality:
pre-commit install
pre-commit run --all-files
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.
Support
For questions or issues, please open an issue on the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file usdm_bc_mapper-0.3.2.tar.gz.
File metadata
- Download URL: usdm_bc_mapper-0.3.2.tar.gz
- Upload date:
- Size: 18.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dbe9fbc6090a5e2c4b36ac1ed3610c7f12dda992c110edaf113f41c811e1f1e
|
|
| MD5 |
6f7860cc8976dd9f18185791f2b85c0a
|
|
| BLAKE2b-256 |
c74d528732303a3f29c70a430cf3aff527b78b11da05243cada2c36fd3fd31af
|
Provenance
The following attestation bundles were made for usdm_bc_mapper-0.3.2.tar.gz:
Publisher:
pypi_release.yml on AI-LENS/usdm-bc-mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
usdm_bc_mapper-0.3.2.tar.gz -
Subject digest:
0dbe9fbc6090a5e2c4b36ac1ed3610c7f12dda992c110edaf113f41c811e1f1e - Sigstore transparency entry: 541220946
- Sigstore integration time:
-
Permalink:
AI-LENS/usdm-bc-mapper@1a511e652b5b144b302bf04c2dc074a950d6385f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AI-LENS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yml@1a511e652b5b144b302bf04c2dc074a950d6385f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file usdm_bc_mapper-0.3.2-py3-none-any.whl.
File metadata
- Download URL: usdm_bc_mapper-0.3.2-py3-none-any.whl
- Upload date:
- Size: 18.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f73276a845d4b3ae8363afcf87604515dfb281f3351fc2fd58ef9559cff47b7
|
|
| MD5 |
cab7d6b6a687892012b1376e987e18e9
|
|
| BLAKE2b-256 |
4f77c896f4e32dedbb861fc7c634e19b39ca01d11c68f9473d57e6171a3ac2a9
|
Provenance
The following attestation bundles were made for usdm_bc_mapper-0.3.2-py3-none-any.whl:
Publisher:
pypi_release.yml on AI-LENS/usdm-bc-mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
usdm_bc_mapper-0.3.2-py3-none-any.whl -
Subject digest:
8f73276a845d4b3ae8363afcf87604515dfb281f3351fc2fd58ef9559cff47b7 - Sigstore transparency entry: 541220948
- Sigstore integration time:
-
Permalink:
AI-LENS/usdm-bc-mapper@1a511e652b5b144b302bf04c2dc074a950d6385f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AI-LENS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yml@1a511e652b5b144b302bf04c2dc074a950d6385f -
Trigger Event:
workflow_dispatch
-
Statement type: