Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

These details have not been verified by PyPI

Project description

datastew

tests codecov version

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.io.source import DataDictionarySource
from datastew.harmonization import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:

from datastew.embedding import Vectorizer
from datastew.harmonization import map_dictionary_to_dictionary

vectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")
df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)

Creating and using stored mappings

Datastew uses a PostgreSQL backend with the pgvector extension to store and query embeddings. This allows for persistent terminology management and high-performance semantic search.

Initialize the repository and embedding model:

First, set up your database engine and ensure the schema is initialized. Datastew uses the psycopg (v3) driver for synchronous communication.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from datastew.embedding import Vectorizer
from datastew.repository import PostgreSQLRepository

# 1. Define connection string (note the +psycopg driver)
DB_URL = "postgresql+psycopg://user:password@localhost:5432/testdb"
engine = create_engine(DB_URL)

# 2. Initialize the database schema and pgvector extension (Run once)
PostgreSQLRepository.setup_database(engine)

# 3. Create a session factory
SessionLocal = sessionmaker(bind=engine, autoflush=False)

Populate the Repository

Use a session context to add terminologies, concepts, and mappings. Note: You must call session.commit() to persist changes before they become searchable.

vectorizer = Vectorizer()

with SessionLocal() as session: # Inject the session into the repository
    repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)

    # Add a terminology
    terminology = repository.add_terminology(name="snomed CT", short_name="SNOMED")

    # Create a concept
    text1 = "Diabetes mellitus (disorder)"
    concept1 = repository.add_concept(
        terminology_id=terminology.id,
        pref_label=text1,
        concept_identifier="SNOMED:11893007"
    )

    # Add a mapping (this generates the embedding and stores it)
    repository.add_mapping(concept_id=concept1.id, text=text1)

    # Persist the data
    session.commit()

Retrieve Closest Mappings

Query the database by generating an embedding for a new phrase and comparing it against stored records.

with SessionLocal() as session:
    repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)

    query_text = "Sugar sickness"
    embedding = vectorizer.get_embedding(query_text)

    # Retrieve top 2 matches with similarity scores
    results = repository.get_closest_mappings(embedding, similarities=True, limit=2)

    for r in results:
        # Returns a MappingResult object containing the Mapping and a similarity float
        print(f"{r.mapping.concept.pref_label} | Similarity: {r.similarity:.4f}")

output:

snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.

Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:

from datastew.embedding import Vectorizer
from datastew.io.source import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")
data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")

vectorizer = Vectorizer()
plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)

t-SNE plot

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.3

Apr 2, 2026

1.0.2

Mar 30, 2026

1.0.1

Mar 27, 2026

1.0.0

Mar 26, 2026

0.6.1

Nov 10, 2025

0.6.0

Aug 14, 2025

0.5.8

May 30, 2025

0.5.7

May 5, 2025

0.5.6

Apr 23, 2025

0.5.5

Apr 22, 2025

0.5.4

Apr 1, 2025

0.5.3

Mar 31, 2025

0.5.2

Mar 17, 2025

0.5.1

Mar 13, 2025

0.5.0

Mar 10, 2025

0.4.2

Jan 6, 2025

0.4.1

Nov 13, 2024

0.4.0

Oct 21, 2024

0.3.5

Aug 26, 2024

0.3.4

Aug 21, 2024

0.3.3

Jul 15, 2024

0.3.2

Jul 15, 2024

0.3.1

Jul 15, 2024

0.3.0

Jul 9, 2024

0.2.0

Jul 1, 2024

0.1.13

Jun 26, 2024

0.1.12

Jun 13, 2024

0.1.11

Jun 13, 2024

0.1.10

Jun 10, 2024

0.1.9

Jun 3, 2024

0.1.7

Jun 3, 2024

0.1.6

Jun 1, 2024

0.1.5

May 29, 2024

0.1.4

May 29, 2024

0.1.3

May 29, 2024

0.1.2

May 28, 2024

0.1.1

May 28, 2024

0.1.0

May 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datastew-1.0.3.tar.gz (32.2 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datastew-1.0.3-py3-none-any.whl (43.9 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file datastew-1.0.3.tar.gz.

File metadata

Download URL: datastew-1.0.3.tar.gz
Upload date: Apr 2, 2026
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for datastew-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`f680d47cabb9dd3fdbb0c891ce36d47bfe9b871cbd827dc1fb0a5afc4cdc86aa`
MD5	`1f63a1de5fda452295e19add6abb51bb`
BLAKE2b-256	`91ee6a600ef15d9c8750f3b130bd1586f359693b58ba53312eb51ad8b5fa42c2`

See more details on using hashes here.

File details

Details for the file datastew-1.0.3-py3-none-any.whl.

File metadata

Download URL: datastew-1.0.3-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 43.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for datastew-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`73fc0ecfb99095febc9b0ba185dc0b2838917b405005d58d144793341b70915a`
MD5	`31fc5cd48a32d3a3567fceb5e3b89566`
BLAKE2b-256	`4b1c99fb792c7b84d82b5261d62552144a28c2f77dcc70bbd002603371dc1577`

See more details on using hashes here.

datastew 1.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

datastew

Installation

Usage

Harmonizing excel/csv resources

Creating and using stored mappings

Embedding visualization

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes