AI-backed harmonization framework

Project description

AI Harmonization

This contains code and related artifacts for powering an AI-assisted data model harmonization tool. It also contains the infrastructure for abstracting approaches, benchmarking them, and generating training data.

Usage

AI-Assisted Data Curation Toolkit

Demo

Harmonization Benchmarking and AI Training

Details

Reference harmonization.harmonization_approaches.similarity_inmem.SimilaritySearchInMemoryVectorDb to understand how to build new harmonization approaches.

Abtractions have been built to allow the evaluation of different overall approaches. There is a base class to implement for a new HarmonizationApproach.

There are 2 Pydantic data types defined to standardize the interface. HarmonizationSuggestions is:

{
    "suggestions": List of `SingleHarmonizationSuggestion`
}

and SingleHarmonizationSuggestion is:

class SingleHarmonizationSuggestion(BaseModel):
    source_node: str
    source_property: str
    source_description: str
    target_node: str
    target_property: str
    target_description: str
    similarity: float = None

This format is convertable to A Simple Standard for Sharing Ontological Mappings (SSSOM) and happens by default when calling get_metrics_for_approach.

An example of how to implement a new approach. Create a new file: harmonization/harmonization_approaches/new_approach.py

from ai_curation.harmonization_approaches.harmonization import (
    HarmonizationApproach,
    HarmonizationSuggestions,
    SingleHarmonizationSuggestion,
)

class NewApproachExample(HarmonizationApproach):

    def __init__(
        self,
    ):
        super().__init__()
        # TODO


    def get_harmonization_suggestions(
        self, input_source_model, input_target_model, **kwargs
    ):
        # TODO
        return HarmonizationSuggestions(suggestions=suggestions)

Note that the current SimilaritySearchInMemoryVectorDb already supports providing a new embedding algorithm.

And now, if you have a benchmark and want to evaluate the new approach:

from harmonization.harmonization_benchmark import get_metrics_for_approach
from harmonization.harmonization_approaches.new_approach import (
    NewApproachExample,
)

new_approach = NewApproachExample()

output_filename = get_metrics_for_approach(
    benchmark_filepath="path/to/benchmark.jsonl",
    harmonization_approach=new_approach,
    output_filename="output.tsv",
    metrics_column_name="custom_metrics",
)

Benchmark Details

JSONL file, each row is a separate test.

Each test should have 3 keys: input_source_model in JSON, with desire to harmonize to input_target_model in JSON, and we expect known mapping defined in harmonized_mapping (which is a TSV represented as a string, with 2 columns ai_model_node_prop_desc and harmonized_model_node_prop_desc).

Example harmonized mapping:

ai_model_node_prop_desc	harmonized_model_node_prop_desc
PatientData.AgeAtDiagnosis: The age of the patient at the time of diagnosis.	standard_patient_record.age_at_diagnosis: Age in years when diagnosed with the condition.
PatientData.LastVisitDate: The date of the last visit to a healthcare provider.	standard_patient_record.last_visit_date: Date of most recent medical appointment or consultation.
OutpatSkinCheck.AnchorDateOffset: Days between the specified anchor date and the patient's last completed skin test date.	outpat_v_skin_test.DaysFromAnchorDateToEventDate:

The output of the get_metrics_for_approach generates a new file with appended metrics per test case in the benchmark.

Local Setup

Python 3.12 is recommended, newer versions may also work.

This uses uv (faster than pip and poetry, easier to work with).

uv sync
uv pip install -e .

Tests

uv run pytest tests/

Note: Tests are fairly minimal at the moment

Other

If for some reason you need to run detect-secrets separately from pre-commit and you have a ton of data or datasets, you can explicitly exclude those directories (they should already ignored by git):

detect-secrets scan --exclude-files '.*/data/.*|.*/datasets/.*|.*/output/.*'

Project details

Release history Release notifications | RSS feed

0.4.0

Nov 20, 2025

0.3.2

Nov 12, 2025

0.3.1

Nov 11, 2025

0.3.0

Nov 11, 2025

0.2.7

Oct 29, 2025

This version

0.2.6

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_harmonization-0.2.6.tar.gz (30.0 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_harmonization-0.2.6-py3-none-any.whl (31.0 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file ai_harmonization-0.2.6.tar.gz.

File metadata

Download URL: ai_harmonization-0.2.6.tar.gz
Upload date: Oct 28, 2025
Size: 30.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.5

File hashes

Hashes for ai_harmonization-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`873ba8da75fcf845ad7b0f0ecfe0144731e973d95c520b0a25551b3f3ea1efde`
MD5	`8736ac122db092c0141403b0bd46f69d`
BLAKE2b-256	`00faf574a9ff1ba5231ecd1c99c9efea61b7db6db4d146cc8b90ddf9dd942f35`

See more details on using hashes here.

File details

Details for the file ai_harmonization-0.2.6-py3-none-any.whl.

File metadata

Download URL: ai_harmonization-0.2.6-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 31.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.5

File hashes

Hashes for ai_harmonization-0.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce6e0c579619a6e77d4b5f7022b636263015c0968773b77be1d836606d534796`
MD5	`ec3aa5d7db4e9f4d8aeed7bc10df1afd`
BLAKE2b-256	`11df2878f53690533b6d9950cb34e85700d98bf784f765b8c57cc9ea8f100e18`

See more details on using hashes here.

ai_harmonization 0.2.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AI Harmonization

Usage

AI-Assisted Data Curation Toolkit

Harmonization Benchmarking and AI Training

Details

Benchmark Details

Local Setup

Tests

Other

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes