Skip to main content

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Project description

datastew

DOI tests codecov version

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.io.source import DataDictionarySource
from datastew.harmonization import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:

from datastew.embedding import Vectorizer
from datastew.harmonization import map_dictionary_to_dictionary

vectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")
df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)

Creating and using stored mappings

Datastew uses a PostgreSQL backend with the pgvector extension to store and query embeddings. This allows for persistent terminology management and high-performance semantic search.

  1. Initialize the repository and embedding model:

    First, set up your database engine and ensure the schema is initialized. Datastew uses the psycopg (v3) driver for synchronous communication.

    from sqlalchemy import create_engine
    from sqlalchemy.orm import sessionmaker
    from datastew.embedding import Vectorizer
    from datastew.repository import PostgreSQLRepository
    
    # 1. Define connection string (note the +psycopg driver)
    DB_URL = "postgresql+psycopg://user:password@localhost:5432/testdb"
    engine = create_engine(DB_URL)
    
    # 2. Initialize the database schema and pgvector extension (Run once)
    PostgreSQLRepository.setup_database(engine)
    
    # 3. Create a session factory
    SessionLocal = sessionmaker(bind=engine, autoflush=False)
    
  2. Populate the Repository

    Use a session context to add terminologies, concepts, and mappings. Note: You must call session.commit() to persist changes before they become searchable.

    vectorizer = Vectorizer()
    
    with SessionLocal() as session: # Inject the session into the repository
        repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)
    
        # Add a terminology
        terminology = repository.add_terminology(name="snomed CT", short_name="SNOMED")
    
        # Create a concept
        text1 = "Diabetes mellitus (disorder)"
        concept1 = repository.add_concept(
            terminology_id=terminology.id,
            pref_label=text1,
            concept_identifier="SNOMED:11893007"
        )
    
        # Add a mapping (this generates the embedding and stores it)
        repository.add_mapping(concept_id=concept1.id, text=text1)
    
        # Persist the data
        session.commit()
    
  3. Retrieve Closest Mappings

    Query the database by generating an embedding for a new phrase and comparing it against stored records.

    with SessionLocal() as session:
        repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)
    
        query_text = "Sugar sickness"
        embedding = vectorizer.get_embedding(query_text)
    
        # Retrieve top 2 matches with similarity scores
        results = repository.get_closest_mappings(embedding, similarities=True, limit=2)
    
        for r in results:
            # Returns a MappingResult object containing the Mapping and a similarity float
            print(f"{r.mapping.concept.pref_label} | Similarity: {r.similarity:.4f}")
    

output:

snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.


Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:

from datastew.embedding import Vectorizer
from datastew.io.source import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")
data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")

vectorizer = Vectorizer()
plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)

t-SNE plot

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datastew-1.0.3.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datastew-1.0.3-py3-none-any.whl (43.9 kB view details)

Uploaded Python 3

File details

Details for the file datastew-1.0.3.tar.gz.

File metadata

  • Download URL: datastew-1.0.3.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for datastew-1.0.3.tar.gz
Algorithm Hash digest
SHA256 f680d47cabb9dd3fdbb0c891ce36d47bfe9b871cbd827dc1fb0a5afc4cdc86aa
MD5 1f63a1de5fda452295e19add6abb51bb
BLAKE2b-256 91ee6a600ef15d9c8750f3b130bd1586f359693b58ba53312eb51ad8b5fa42c2

See more details on using hashes here.

File details

Details for the file datastew-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: datastew-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 43.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for datastew-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 73fc0ecfb99095febc9b0ba185dc0b2838917b405005d58d144793341b70915a
MD5 31fc5cd48a32d3a3567fceb5e3b89566
BLAKE2b-256 4b1c99fb792c7b84d82b5261d62552144a28c2f77dcc70bbd002603371dc1577

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page