Embeddings based linker for MedCAT

These details have been verified by PyPI

Project links

Source

Owner

CogStack

GitHub Statistics

These details have not been verified by PyPI

Project links

Project description

MedCAT Embedding Linker

A MedCAT plugin that provides an embedding-based entity linking component using transformer models from HuggingFace.

Overview

This plugin replaces MedCAT's default linking component with a transformer-based approach that uses semantic similarity between entity contexts and concept embeddings to perform entity disambiguation.

Key features:

Semantic similarity-based linking using transformer embeddings
Support for any HuggingFace sentence-transformer model
Efficient batch processing with GPU acceleration
Configurable similarity thresholds and context windows
CUI-based filtering (include/exclude lists)

Requirements

MedCAT: 2.0+ (PyPI | GitHub)
Python 3.10+
PyTorch
Transformers

Installation

pip install medcat-embedding-linker

Quick Start

from medcat.cat import CAT
from medcat.config import Config
from medcat.components.types import CoreComponentType

from medcat_embedding_linker import EmbeddingLinking

# Load your MedCAT model
cat = CAT.load_model_pack("path/to/model_pack")

# Configure the embedding linker
cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Recreate the pipeline to register the new linker
cat._recreate_pipe()

# Generate embeddings for your concept database
linker = self.get_component(CoreComponentType.linking)
# create 
linker.create_embeddings()

# Use as normal
entities = cat.get_entities("Patient presents with chest pain and dyspnea.")

How It Works

Component Registration

The embedding linker automatically registers itself as embedding_linker when EmbeddingLinking config is detected. It implements MedCAT's AbstractEntityProvidingComponent interface and is lazily loaded when the pipeline is created.

Embedding Generation

The linker operates on two types of embeddings:

1. Concept Embeddings (pre-computed)

Each CUI is represented by its longest name's embedding
Stored in cdb.addl_info["cui_embeddings"]
Used for final disambiguation between candidate CUIs

2. Name Embeddings (pre-computed)

Each concept name in the CDB gets its own embedding
Stored in cdb.addl_info["name_embeddings"]
Used for initial candidate retrieval

Both are generated via linker.create_embeddings() and cached for inference.

Inference Process

For each detected entity:

Context Vector Calculation: Extract a text snippet around the entity (size controlled by context_window_size) and embed it
Candidate Retrieval: Compare context embedding against all name embeddings to find top matches above short_similarity_threshold
Disambiguation: If multiple CUIs are associated with the best-matching name, compare against CUI embeddings to select the final concept
Filtering: Apply CUI include/exclude filters and check against long_similarity_threshold

Configuration

Key Parameters

config.components.linking = EmbeddingLinking(
    # Model settings
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
    max_token_length=128,
    
    # Context settings
    context_window_size=10,  # tokens on each side of entity
    
    # Similarity thresholds
    short_similarity_threshold=0.3,  # for candidate retrieval
    long_similarity_threshold=0.5,   # for final linking
    
    # Batch sizes
    embedding_batch_size=4096,
    linking_batch_size=512,
    
    # Filtering
    filters=Filters(
        cuis={"C0018802", "C0011849"},  # include only these
        cuis_exclude={"C0000001"}        # or exclude these
    ),
    
    # Advanced options
    use_ner_link_candidates=True,
    always_calculate_similarity=False,
    filter_before_disamb=True,
    gpu_device="cuda:0"  # or None for auto-detect
)

Embedding Models

Any HuggingFace model compatible with sentence transformers will work. Popular options:

sentence-transformers/all-MiniLM-L6-v2 (default, fast and lightweight)
sentence-transformers/all-mpnet-base-v2 (higher quality)
UFNLP/gatortron-medium (biomedical domain)
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext

Advanced Usage

Re-generating Embeddings

If you modify your CDB or want to try a different model:

linker = cat.get_component("embedding_linker")
linker.create_embeddings(
    embedding_model_name="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    max_length=256
)

GPU Configuration

# Use specific GPU
cat.config.components.linking.gpu_device = "cuda:1"

# Force CPU
cat.config.components.linking.gpu_device = "cpu"

Filtering

# Include only specific CUIs
cat.config.components.linking.filters.cuis = {"C0011849", "C0018802"}

# Exclude specific CUIs
cat.config.components.linking.filters.cuis_exclude = {"C0000001"}

# Note: If both are set, only include filters are applied

Performance Considerations

First-time embedding generation: Can take several minutes for large CDBs (millions of concepts)
GPU recommended: 10-50x faster inference with CUDA
Batch sizes: Increase if you have GPU memory available
Model selection: Smaller models (e.g., MiniLM) are faster but may be less accurate than larger domain-specific models

Limitations

Does not support prefer_frequent_concepts or prefer_primary_name from the default linker (logs warnings if set)
Training mode is not applicable (logs warning if enabled)
Requires pre-computed embeddings before inference

Citation

If you use this plugin, please cite MedCAT:

@article{medcat2021,
    title={Medical Concept Annotation Tool (MedCAT)},
    author={Kraljevic, Zeljko and et al.},
    journal={arXiv preprint arXiv:2010.01165},
    year={2021}
}

Project details

These details have been verified by PyPI

Project links

Source

Owner

CogStack

GitHub Statistics

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 13, 2026

This version

0.2.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat_embedding_linker-0.2.0.tar.gz (33.5 MB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

medcat_embedding_linker-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file medcat_embedding_linker-0.2.0.tar.gz.

File metadata

Download URL: medcat_embedding_linker-0.2.0.tar.gz
Upload date: Feb 13, 2026
Size: 33.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for medcat_embedding_linker-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fea159c275ab8e1f3a23ea795d5aa04c8f61b57ff0486a8153e24eda4ee30613`
MD5	`fd637f6580d29f25e04d4d2bda840dc7`
BLAKE2b-256	`22771fddd3e28bf586023d88a727495c34fc258d1a9923f30c3ef79243d9e6c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for medcat_embedding_linker-0.2.0.tar.gz:

Publisher: medcat-embedding-linker_ci.yml on CogStack/cogstack-nlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: medcat_embedding_linker-0.2.0.tar.gz
- Subject digest: fea159c275ab8e1f3a23ea795d5aa04c8f61b57ff0486a8153e24eda4ee30613
- Sigstore transparency entry: 947473532
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: CogStack/cogstack-nlp@0f5c76d0edef97f2ffb3bbce15da7fa50cbebcf2
- Branch / Tag: refs/tags/medcat-embedding-linker/v0.2.0
- Owner: https://github.com/CogStack
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: medcat-embedding-linker_ci.yml@0f5c76d0edef97f2ffb3bbce15da7fa50cbebcf2
- Trigger Event: push

File details

Details for the file medcat_embedding_linker-0.2.0-py3-none-any.whl.

File metadata

Download URL: medcat_embedding_linker-0.2.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for medcat_embedding_linker-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`360ccfb8619b60588549ba04fdb0bc2799fcf5bfaa5d3d93b06a628fa9f6325b`
MD5	`936f8312d854a2c328d2c171bf7ce1db`
BLAKE2b-256	`d955545e9c25832f6c1650953e915164de487764e289f2722bf161c1c71eab38`

See more details on using hashes here.

Provenance

The following attestation bundles were made for medcat_embedding_linker-0.2.0-py3-none-any.whl:

Publisher: medcat-embedding-linker_ci.yml on CogStack/cogstack-nlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: medcat_embedding_linker-0.2.0-py3-none-any.whl
- Subject digest: 360ccfb8619b60588549ba04fdb0bc2799fcf5bfaa5d3d93b06a628fa9f6325b
- Sigstore transparency entry: 947473535
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: CogStack/cogstack-nlp@0f5c76d0edef97f2ffb3bbce15da7fa50cbebcf2
- Branch / Tag: refs/tags/medcat-embedding-linker/v0.2.0
- Owner: https://github.com/CogStack
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: medcat-embedding-linker_ci.yml@0f5c76d0edef97f2ffb3bbce15da7fa50cbebcf2
- Trigger Event: push

medcat-embedding-linker 0.2.0

Navigation

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Project links

Meta

Classifiers

Project description

MedCAT Embedding Linker

Overview

Requirements

Installation

Quick Start

How It Works

Component Registration

Embedding Generation

Inference Process

Configuration

Key Parameters

Embedding Models

Advanced Usage

Re-generating Embeddings

GPU Configuration

Filtering

Performance Considerations

Limitations

Citation

Project details

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance