Skip to main content

A friendly way to do link, aggregate, cluster and de-duplicate dataframes using large language models.

Project description

LinkTransformer

arXiv LinkTransformer demo

LinkTransformer is a Python package for semantic record linkage, candidate retrieval, row transformation, clustering, and text classification over tabular data.

Tutorials

More tutorials are coming soon.

Installation

pip install linktransformer

Quick Start

import os
import pandas as pd
import linktransformer as lt

left_df = pd.DataFrame({"CompanyName": ["Tech Corporation"], "Country": ["USA"]})
right_df = pd.DataFrame({"CompanyName": ["Tech Corp"], "Country": ["USA"]})

out = lt.merge(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
)
print(out[["CompanyName_x", "CompanyName_y", "score"]])

NEW RELEASE: End-to-end linkage Workflow: merge_k_judge (End-to-End Record Linkage)

merge_k_judge is the recommended end-to-end linkage API when you want both retrieval and LLM adjudication with confidence.

  1. Retrieve top-k candidates with embeddings (merge_knn)
  2. Judge each candidate pair with an LLM
  3. Return match decisions and confidence scores
judged = lt.merge_k_judge(
    df1=left_df,
    df2=right_df,
    on=["CompanyName", "Country"],
    k=5,
    knn_sbert_model="sentence-transformers/all-MiniLM-L6-v2",
    judge_llm_model="gpt-4o-mini",
    llm_provider="openai",
    openai_key=os.getenv("OPENAI_API_KEY"),
)

# key output columns:
# - score (retrieval similarity)
# - is_match (bool)
# - confidence (float in [0, 1] when available)

You can also combine providers (for example OpenAI embeddings retrieval + Gemini judge) by setting knn_api_model, judge_llm_model, and llm_provider explicitly.

Core APIs

1) Link two dataframes

  • lt.merge(...): semantic 1:1 / 1:m / m:1 linkage.
  • lt.merge_knn(...): top-k candidate retrieval.
  • lt.merge_blocking(...): run merge within blocks to do fuzzy merge within exact matches.
  • lt.aggregate_rows(...): map fine rows to coarser labels.
matches = lt.merge_knn(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
    k=3,
)

2) Transform rows with LLM prompts

Use lt.transform_rows(...) to normalize, rewrite, or standardize values in one or more columns. Eg : Fix OCR errors in the Column, Standardize names.

cleaned = lt.transform_rows(
    left_df,
    on=["CompanyName", "Country"],
    model="gpt-4o-mini",
    openai_key=os.getenv("OPENAI_API_KEY"),
    openai_prompt=(
        "Standardize organization names and country strings for record linkage. "
        "Return a JSON list in the same order."
    ),
)
# adds: transformed_CompanyName-Country

3) Cluster and deduplicate

  • lt.cluster_rows(...): cluster semantically similar rows.
  • lt.dedup_rows(...): cluster + keep representative rows.
deduped = lt.dedup_rows(
    left_df,
    on="CompanyName",
    model="sentence-transformers/all-MiniLM-L6-v2",
    cluster_type="agglomerative",
    cluster_params={"threshold": 0.7},
)

4) Evaluate matched pairs

  • lt.evaluate_pairs(...): similarity over known pairs.
  • lt.all_pair_combos_evaluate(...): dense pairwise scoring.

5) Classification

  • lt.classify_rows(...): classify rows with HF or OpenAI chat models.
  • lt.train_clf_model(...): train a custom row classifier.

6) Train linkage models

  • lt.train_model(...): train a linkage model from paired or clustered data.

Provider Notes

  • OpenAI key: set OPENAI_API_KEY or pass openai_key.
  • Gemini key: set GEMINI_API_KEY or pass gemini_key.
  • API embedding models and local SBERT models are both supported.
  • For multi-column API retrieval, LinkTransformer serializes columns safely using <SEP>.

Test Naming Convention

Tests use test_lt_* naming to mirror the package API surface and make workflows discoverable.

Contributing

Issues and pull requests are welcome.

License

This project is licensed under the MIT License. See LICENSE.

Maintainers

  • Sam Jones (samuelcaronnajones)
  • Abhishek Arora (econabhishek)
  • Yiyang Chen (oooyiyangc)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linktransformer-0.1.18.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linktransformer-0.1.18-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file linktransformer-0.1.18.tar.gz.

File metadata

  • Download URL: linktransformer-0.1.18.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for linktransformer-0.1.18.tar.gz
Algorithm Hash digest
SHA256 71eee6bad83ff840db15634e7a9e7470bd6d6eb12d7f26711fb0b2f5dc32bc2f
MD5 21216b650ca678c46bc67bb7d22f3ddd
BLAKE2b-256 dfd19e1be1a47463ecc5c3261cde77fb03f38b0ea2c9691dfef7ded1b9c55875

See more details on using hashes here.

File details

Details for the file linktransformer-0.1.18-py3-none-any.whl.

File metadata

File hashes

Hashes for linktransformer-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 1fc70cad55b87c6b9072cbd38fe6d6124f2bf268fd6569561a2205f1be2caaaf
MD5 c7691dc77284b57ce472788e7a5be82a
BLAKE2b-256 54a9b0eeccb4059ee717884dd019317eb9a6de7ec171f690310ef7ded5374e33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page