Intelligent data steward toolbox using Large Language Model embeddings for automated Data-Harmonization.

These details have not been verified by PyPI

Project links

GitHub Statistics

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Project description

datastew

tests GitHub Release

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two seperate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.process.parsing import DataDictionarySource
from datastew.process.mapping import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create an OpenAIAdapter model and pass it to the function:

from datastew.embedding import GPT4Adapter

embedding_model = GPT4Adapter(key="your_api_key")
df = map_dictionary_to_dictionary(source, target, embedding_model=embedding_model)

Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in datastew/scripts/mapping_db_example.py:

from datastew.repository.sqllite import SQLLiteRepository
from datastew.repository.model import Terminology, Concept, Mapping
from datastew.embedding import MPNetAdapter

# omit mode to create a permanent db file instead
repository = SQLLiteRepository(mode="memory")
embedding_model = MPNetAdapter()

terminology = Terminology("snomed CT", "SNOMED")

text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, embedding_model.get_embedding(text1))

text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, embedding_model.get_embedding(text2))

repository.store_all([terminology, concept1, mapping1, concept2, mapping2])

text_to_map = "Sugar sickness"
embedding = embedding_model.get_embedding(text_to_map)
mappings, similarities = repository.get_closest_mappings(embedding, limit=2)
for mapping, similarity in zip(mappings, similarities):
    print(f"Similarity: {similarity} -> {mapping}")

output:

Similarity: 0.47353370635583486 -> Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder)
Similarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder)

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language

Release history Release notifications | RSS feed

0.3.3

Jul 15, 2024

0.3.2

Jul 15, 2024

This version

0.3.1

Jul 15, 2024

0.3.0

Jul 9, 2024

0.2.0

Jul 1, 2024

0.1.13

Jun 26, 2024

0.1.12

Jun 13, 2024

0.1.11

Jun 13, 2024

0.1.10

Jun 10, 2024

0.1.9

Jun 3, 2024

0.1.7

Jun 3, 2024

0.1.6

Jun 1, 2024

0.1.5

May 29, 2024

0.1.4

May 29, 2024

0.1.3

May 29, 2024

0.1.2

May 28, 2024

0.1.1

May 28, 2024

0.1.0

May 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datastew-0.3.1.tar.gz (27.3 kB view hashes)

Uploaded Jul 15, 2024 Source

Built Distribution

datastew-0.3.1-py3-none-any.whl (32.1 kB view hashes)

Uploaded Jul 15, 2024 Python 3

Hashes for datastew-0.3.1.tar.gz

Hashes for datastew-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`3faed1a98f088c08a402ded4902f02c01e2b2f759fd44829b195580350b3cca3`
MD5	`3468b9394d4e25005a102ada038e9df2`
BLAKE2b-256	`6b648d98976af5e58979e3fdc369a9c4d03746a7fae7ada2418dba2d539be621`

Hashes for datastew-0.3.1-py3-none-any.whl

Hashes for datastew-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68edf829e1f2f2398715715f4ec527d13db2d97680054e282e04b13b39f03df3`
MD5	`1e217a519331b64c277cd33d62d05d74`
BLAKE2b-256	`1486d9f885df307fc90619b483708b9f6708c767da966116b302ca36d37d2212`