Ontology mapping for Open Targets

Project description

OnToma

Introduction

OnToma is a Python package for mapping entities to identifiers using lookup tables. It is optimised for large-scale entity mapping, and is designed to work with PySpark DataFrames.

OnToma supports the mapping of two kinds of entities: labels (e.g. brachydactyly) and ids (e.g. OMIM:112500).

OnToma includes a NER (Named Entity Recognition) module for extracting clean entity names from raw text labels. This is useful when your data contains labels that need preprocessing. Currently, this feature is available for drugs and diseases. To use NER features, see NER Module Documentation.

OnToma currently has modules to generate lookup tables from the following datasources:

Open Targets disease, target, and drug indices
Disease curation tables with the SEMANTIC_TAG and PROPERTY_VALUE fields (e.g. the Open Targets disease curation table)
You can also provide your own curation tables as long as they are compatible with the defined schema

The package features entity normalisation using Spark NLP, where entities in both the lookup table and the input dataframe are normalised to improve entity matching.

Successfully mapped entities may be mapped to multiple identifiers.

Prerequisites

Java Runtime Environment

OnToma requires OpenJDK 8 or 11 to be installed on your system, as it's a prerequisite for PySpark and Spark-NLP.

macOS Installation

Install OpenJDK 8 or 11 using Homebrew:

brew install openjdk@11

After installation, you need to set the JAVA_HOME environment variable. Add the following to your shell configuration file (e.g., ~/.zshrc or ~/.bash_profile):

export JAVA_HOME="/opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home"
export PATH="$JAVA_HOME/bin:$PATH"

Reload your shell configuration:

source ~/.zshrc

Verify the installation:

java -version

Installation

pip install ontoma

Spark session configuration

OnToma requires a Spark session configured to include the Spark NLP library.

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# add Spark NLP library to Spark configuration
config = (
    SparkConf()
    .set("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3")
)

# create Spark session
spark = SparkSession.builder.config(conf=config).getOrCreate()

Usage example

Here is an example showing how OnToma can be used to map diseases:

First, load data to generate a disease label lookup table:

from ontoma import OnToma, OpenTargetsDisease

disease_index = spark.read.parquet("path/to/disease/index")
disease_label_lut = OpenTargetsDisease.as_label_lut(disease_index)

Then, create the OnToma object to be used for mapping entities:

ont = OnToma(
    spark = spark, 
    entity_lut_list = [disease_label_lut]
)

Given an input PySpark DataFrame disease_df containing the diseases to be mapped in the column disease_name:

mapped_disease_df = ont.map_entities(
    df = disease_df,
    result_col_name = "mapped_ids",
    entity_col_name = "disease_name",
    entity_kind = "label",
    type_col = f.lit("DS")
)

Mapping results can be found in the column mapped_ids. The results will be in the form of a list of identifiers that the entity is successfully mapped to.

Using NER for preprocessing (drugs)

When your drug labels contain dosages, forms, or brand names, use the NER module to extract clean entity names before mapping:

from ontoma.ner.drug import extract_drug_entities
import pyspark.sql.functions as f

# Extract clean drug entities from raw labels
df_extracted = extract_drug_entities(
    spark=spark,
    df=raw_drug_df,
    input_col="raw_drug_label",
    output_col="extracted_drugs"
)

# Explode arrays for mapping
df_exploded = df_extracted.select("*", f.explode("extracted_drugs").alias("clean_drug"))

# Map with OnToma
mapped_df = ont.map_entities(
    df=df_exploded,
    entity_col_name="clean_drug",
    entity_kind="label",
    type_col=f.lit("drug")
)

See NER Module Documentation for more details.

Speeding up subsequent OnToma usage

PySpark uses lazy evaluation, meaning transformations are not executed until an action is triggered.

When using the same OnToma object multiple times, it is recommended to specify a cache directory when creating the OnToma object using the cache_dir parameter to avoid re-running the lookup table processing logic on each use.

ont = OnToma(
    spark = spark, 
    entity_lut_list = [disease_label_lut],
    cache_dir = "path/to/cache/dir"
)

Development

Running Tests

Install development dependencies:

uv sync --dev

Run all tests:

uv run pytest

Skip slow tests (e.g., NER tests that download large models):

uv run pytest -m "not slow"

Project details

Release history Release notifications | RSS feed

2.4.1

May 22, 2026

This version

2.4.0

May 14, 2026

2.3.1

Feb 27, 2026

2.3.0

Feb 26, 2026

2.1.1

Nov 13, 2025

2.1.0

Nov 5, 2025

2.0.0

Oct 20, 2025

1.1.2

Aug 8, 2024

1.1.0

Jan 31, 2023

1.0.3

Nov 21, 2022

1.0.2

Jan 31, 2022

1.0.1

Nov 2, 2021

1.0.0

Jul 29, 2021

0.0.18

May 26, 2021

0.0.17

Oct 26, 2020

0.0.16

Aug 19, 2020

0.0.15

Jul 17, 2020

0.0.14

Jun 19, 2020

0.0.13

Apr 25, 2018

0.0.11

Apr 25, 2018

0.0.6

Mar 30, 2018

0.0.5

Mar 26, 2018

0.0.2

Mar 21, 2018

0.0.1

Mar 15, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ontoma-2.4.0.tar.gz (27.2 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ontoma-2.4.0-py3-none-any.whl (38.9 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file ontoma-2.4.0.tar.gz.

File metadata

Download URL: ontoma-2.4.0.tar.gz
Upload date: May 14, 2026
Size: 27.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for ontoma-2.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ebc0072afd105a37f02025d78644255695fd15fbea2c2b31c0a1d52f25cf8599`
MD5	`1d001d5e6c0e978e52907ff6319c8109`
BLAKE2b-256	`311f721aacfa3efd5e6c99c56f381b59880ab8e2e9bb01d271c777de07e73c7f`

See more details on using hashes here.

File details

Details for the file ontoma-2.4.0-py3-none-any.whl.

File metadata

Download URL: ontoma-2.4.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 38.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for ontoma-2.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e05d774de579ae4638125461191f6ba75f222f47bde9f38878fff9bd4c0d295`
MD5	`9c8843d2585843007a725b8daa371ba0`
BLAKE2b-256	`793dbf912f0c15dd11457e389534d1993ef034244bed58b5819e9d7a2475920a`

See more details on using hashes here.

ontoma 2.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

OnToma

Introduction

Prerequisites

Java Runtime Environment

macOS Installation

Installation

Spark session configuration

Usage example

Using NER for preprocessing (drugs)

Speeding up subsequent OnToma usage

Development

Running Tests

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes