A friendly way to do link, aggregate, cluster and de-duplicate dataframes using large language models.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LinkTransformer

LinkTransformer demo

LinkTransformer is a Python package for semantic record linkage, candidate retrieval, row transformation, clustering, and text classification over tabular data.

Paper: https://arxiv.org/abs/2309.00789
Website: https://linktransformer.github.io/
Demo video: https://www.youtube.com/watch?v=Sn47nmCvV9M

Tutorials

Link records with LinkTransformer: https://colab.research.google.com/drive/1OqUB8sqpUvrnC8oa_1RoOUzV6DaAKL4N?usp=sharing
Train your own LinkTransformer model: https://colab.research.google.com/drive/1tHitPGjMMI2Nvh4wwA8rdcbYfbLaJDvg?usp=sharing
Classify text with LinkTransformer: https://colab.research.google.com/drive/1hSh_p8j7LP2RfdtxrPslOfnogC_CbYw5?usp=sharing
Demo app (Hugging Face Space): https://huggingface.co/spaces/96abhishekarora/linktransformer_merge
Feature deck: https://www.dropbox.com/scl/fi/dquxru8bndlyf9na14cw6/A-python-package-to-do-easy-record-linkage-using-Transformer-models.pdf?rlkey=fiv7j6c0vgl901y940054eptk&dl=0

Installation

pip install linktransformer

Quick Start

import os
import pandas as pd
import linktransformer as lt

left_df = pd.DataFrame({"CompanyName": ["Tech Corporation"], "Country": ["USA"]})
right_df = pd.DataFrame({"CompanyName": ["Tech Corp"], "Country": ["USA"]})

out = lt.merge(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
)
print(out[["CompanyName_x", "CompanyName_y", "score"]])

NEW RELEASE: End-to-end linkage Workflow: `merge_k_judge` (End-to-End Record Linkage)

merge_k_judge is the recommended end-to-end linkage API when you want both retrieval and LLM adjudication with confidence.

Retrieve top-k candidates with embeddings (merge_knn)
Judge each candidate pair with an LLM
Return match decisions and confidence scores

judged = lt.merge_k_judge(
    df1=left_df,
    df2=right_df,
    on=["CompanyName", "Country"],
    k=5,
    knn_sbert_model="sentence-transformers/all-MiniLM-L6-v2",
    judge_llm_model="gpt-4o-mini",
    llm_provider="openai",
    openai_key=os.getenv("OPENAI_API_KEY"),
)

# key output columns:
# - score (retrieval similarity)
# - is_match (bool)
# - confidence (float in [0, 1] when available)

You can also combine providers (for example OpenAI embeddings retrieval + Gemini judge) by setting knn_api_model, judge_llm_model, and llm_provider explicitly.

Core APIs

1) Link two dataframes

lt.merge(...): semantic 1:1 / 1:m / m:1 linkage.
lt.merge_knn(...): top-k candidate retrieval.
lt.merge_blocking(...): run merge within blocks to do fuzzy merge within exact matches.
lt.aggregate_rows(...): map fine rows to coarser labels.

matches = lt.merge_knn(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
    k=3,
)

2) Transform rows with LLM prompts

Use lt.transform_rows(...) to normalize, rewrite, or standardize values in one or more columns. Eg : Fix OCR errors in the Column, Standardize names.

cleaned = lt.transform_rows(
    left_df,
    on=["CompanyName", "Country"],
    model="gpt-4o-mini",
    openai_key=os.getenv("OPENAI_API_KEY"),
    openai_prompt=(
        "Standardize organization names and country strings for record linkage. "
        "Return a JSON list in the same order."
    ),
)
# adds: transformed_CompanyName-Country

3) Cluster and deduplicate

lt.cluster_rows(...): cluster semantically similar rows.
lt.dedup_rows(...): cluster + keep representative rows.

deduped = lt.dedup_rows(
    left_df,
    on="CompanyName",
    model="sentence-transformers/all-MiniLM-L6-v2",
    cluster_type="agglomerative",
    cluster_params={"threshold": 0.7},
)

4) Evaluate matched pairs

lt.evaluate_pairs(...): similarity over known pairs.
lt.all_pair_combos_evaluate(...): dense pairwise scoring.

5) Classification

lt.classify_rows(...): classify rows with HF or OpenAI chat models.
lt.train_clf_model(...): train a custom row classifier.

6) Train linkage models

lt.train_model(...): train a linkage model from paired or clustered data.

Provider Notes

OpenAI key: set OPENAI_API_KEY or pass openai_key.
Gemini key: set GEMINI_API_KEY or pass gemini_key.
API embedding models and local SBERT models are both supported.
For multi-column API retrieval, LinkTransformer serializes columns safely using <SEP>.

Test Naming Convention

Tests use test_lt_* naming to mirror the package API surface and make workflows discoverable.

Contributing

Issues and pull requests are welcome.

License

This project is licensed under the MIT License. See LICENSE.

Maintainers

Sam Jones (samuelcaronnajones)
Abhishek Arora (econabhishek)
Yiyang Chen (oooyiyangc)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.18

Feb 15, 2026

0.1.17

Jan 24, 2025

0.1.16

Jan 22, 2025

0.1.15

May 29, 2024

0.1.14

Apr 5, 2024

0.1.13

Jan 22, 2024

0.1.12

Nov 28, 2023

0.1.11

Oct 21, 2023

0.1.10

Sep 26, 2023

0.1.9

Sep 25, 2023

0.1.8

Sep 22, 2023

0.1.7

Aug 30, 2023

0.1.6

Aug 10, 2023

0.1.5

Aug 10, 2023

0.1.3

Aug 5, 2023

0.1.2

Aug 4, 2023

0.1.1

Aug 4, 2023

0.1.0

Aug 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linktransformer-0.1.18.tar.gz (2.8 MB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

linktransformer-0.1.18-py3-none-any.whl (2.8 MB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file linktransformer-0.1.18.tar.gz.

File metadata

Download URL: linktransformer-0.1.18.tar.gz
Upload date: Feb 15, 2026
Size: 2.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for linktransformer-0.1.18.tar.gz
Algorithm	Hash digest
SHA256	`71eee6bad83ff840db15634e7a9e7470bd6d6eb12d7f26711fb0b2f5dc32bc2f`
MD5	`21216b650ca678c46bc67bb7d22f3ddd`
BLAKE2b-256	`dfd19e1be1a47463ecc5c3261cde77fb03f38b0ea2c9691dfef7ded1b9c55875`

See more details on using hashes here.

File details

Details for the file linktransformer-0.1.18-py3-none-any.whl.

File metadata

Download URL: linktransformer-0.1.18-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 2.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for linktransformer-0.1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fc70cad55b87c6b9072cbd38fe6d6124f2bf268fd6569561a2205f1be2caaaf`
MD5	`c7691dc77284b57ce472788e7a5be82a`
BLAKE2b-256	`54a9b0eeccb4059ee717884dd019317eb9a6de7ec171f690310ef7ded5374e33`

See more details on using hashes here.

linktransformer 0.1.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LinkTransformer

Tutorials

Installation

Quick Start

NEW RELEASE: End-to-end linkage Workflow: merge_k_judge (End-to-End Record Linkage)

Core APIs

1) Link two dataframes

2) Transform rows with LLM prompts

3) Cluster and deduplicate

4) Evaluate matched pairs

5) Classification

6) Train linkage models

Provider Notes

Test Naming Convention

Contributing

License

Maintainers

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

NEW RELEASE: End-to-end linkage Workflow: `merge_k_judge` (End-to-End Record Linkage)