A friendly way to do link, aggregate, cluster and de-duplicate dataframes using large language models.
Project description
LinkTransformer
LinkTransformer is a Python package for semantic record linkage, candidate retrieval, row transformation, clustering, and text classification over tabular data.
- Paper: https://arxiv.org/abs/2309.00789
- Website: https://linktransformer.github.io/
- Demo video: https://www.youtube.com/watch?v=Sn47nmCvV9M
Tutorials
- Link records with LinkTransformer: https://colab.research.google.com/drive/1OqUB8sqpUvrnC8oa_1RoOUzV6DaAKL4N?usp=sharing
- Train your own LinkTransformer model: https://colab.research.google.com/drive/1tHitPGjMMI2Nvh4wwA8rdcbYfbLaJDvg?usp=sharing
- Classify text with LinkTransformer: https://colab.research.google.com/drive/1hSh_p8j7LP2RfdtxrPslOfnogC_CbYw5?usp=sharing
- Demo app (Hugging Face Space): https://huggingface.co/spaces/96abhishekarora/linktransformer_merge
- Feature deck: https://www.dropbox.com/scl/fi/dquxru8bndlyf9na14cw6/A-python-package-to-do-easy-record-linkage-using-Transformer-models.pdf?rlkey=fiv7j6c0vgl901y940054eptk&dl=0
More tutorials are coming soon.
Installation
pip install linktransformer
Quick Start
import os
import pandas as pd
import linktransformer as lt
left_df = pd.DataFrame({"CompanyName": ["Tech Corporation"], "Country": ["USA"]})
right_df = pd.DataFrame({"CompanyName": ["Tech Corp"], "Country": ["USA"]})
out = lt.merge(
left_df,
right_df,
on=["CompanyName", "Country"],
model="sentence-transformers/all-MiniLM-L6-v2",
)
print(out[["CompanyName_x", "CompanyName_y", "score"]])
NEW RELEASE: End-to-end linkage Workflow: merge_k_judge (End-to-End Record Linkage)
merge_k_judge is the recommended end-to-end linkage API when you want both retrieval and LLM adjudication with confidence.
- Retrieve top-
kcandidates with embeddings (merge_knn) - Judge each candidate pair with an LLM
- Return match decisions and confidence scores
judged = lt.merge_k_judge(
df1=left_df,
df2=right_df,
on=["CompanyName", "Country"],
k=5,
knn_sbert_model="sentence-transformers/all-MiniLM-L6-v2",
judge_llm_model="gpt-4o-mini",
llm_provider="openai",
openai_key=os.getenv("OPENAI_API_KEY"),
)
# key output columns:
# - score (retrieval similarity)
# - is_match (bool)
# - confidence (float in [0, 1] when available)
You can also combine providers (for example OpenAI embeddings retrieval + Gemini judge) by setting knn_api_model, judge_llm_model, and llm_provider explicitly.
Core APIs
1) Link two dataframes
lt.merge(...): semantic 1:1 / 1:m / m:1 linkage.lt.merge_knn(...): top-kcandidate retrieval.lt.merge_blocking(...): run merge within blocks to do fuzzy merge within exact matches.lt.aggregate_rows(...): map fine rows to coarser labels.
matches = lt.merge_knn(
left_df,
right_df,
on=["CompanyName", "Country"],
model="sentence-transformers/all-MiniLM-L6-v2",
k=3,
)
2) Transform rows with LLM prompts
Use lt.transform_rows(...) to normalize, rewrite, or standardize values in one or more columns. Eg : Fix OCR errors in the Column, Standardize names.
cleaned = lt.transform_rows(
left_df,
on=["CompanyName", "Country"],
model="gpt-4o-mini",
openai_key=os.getenv("OPENAI_API_KEY"),
openai_prompt=(
"Standardize organization names and country strings for record linkage. "
"Return a JSON list in the same order."
),
)
# adds: transformed_CompanyName-Country
3) Cluster and deduplicate
lt.cluster_rows(...): cluster semantically similar rows.lt.dedup_rows(...): cluster + keep representative rows.
deduped = lt.dedup_rows(
left_df,
on="CompanyName",
model="sentence-transformers/all-MiniLM-L6-v2",
cluster_type="agglomerative",
cluster_params={"threshold": 0.7},
)
4) Evaluate matched pairs
lt.evaluate_pairs(...): similarity over known pairs.lt.all_pair_combos_evaluate(...): dense pairwise scoring.
5) Classification
lt.classify_rows(...): classify rows with HF or OpenAI chat models.lt.train_clf_model(...): train a custom row classifier.
6) Train linkage models
lt.train_model(...): train a linkage model from paired or clustered data.
Provider Notes
- OpenAI key: set
OPENAI_API_KEYor passopenai_key. - Gemini key: set
GEMINI_API_KEYor passgemini_key. - API embedding models and local SBERT models are both supported.
- For multi-column API retrieval, LinkTransformer serializes columns safely using
<SEP>.
Test Naming Convention
Tests use test_lt_* naming to mirror the package API surface and make workflows discoverable.
Contributing
Issues and pull requests are welcome.
License
This project is licensed under the MIT License. See LICENSE.
Maintainers
- Sam Jones (
samuelcaronnajones) - Abhishek Arora (
econabhishek) - Yiyang Chen (
oooyiyangc)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linktransformer-0.1.18.tar.gz.
File metadata
- Download URL: linktransformer-0.1.18.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71eee6bad83ff840db15634e7a9e7470bd6d6eb12d7f26711fb0b2f5dc32bc2f
|
|
| MD5 |
21216b650ca678c46bc67bb7d22f3ddd
|
|
| BLAKE2b-256 |
dfd19e1be1a47463ecc5c3261cde77fb03f38b0ea2c9691dfef7ded1b9c55875
|
File details
Details for the file linktransformer-0.1.18-py3-none-any.whl.
File metadata
- Download URL: linktransformer-0.1.18-py3-none-any.whl
- Upload date:
- Size: 2.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fc70cad55b87c6b9072cbd38fe6d6124f2bf268fd6569561a2205f1be2caaaf
|
|
| MD5 |
c7691dc77284b57ce472788e7a5be82a
|
|
| BLAKE2b-256 |
54a9b0eeccb4059ee717884dd019317eb9a6de7ec171f690310ef7ded5374e33
|