Skip to main content

Fuzzy join DataFrames using LLM scoring and embedding retrieval

Project description

llm-join

README is a work in progress — more examples and documentation coming soon.

The pandas join that understands what your data means.

pd.merge joins on exact values. llm-join joins on meaning — using embeddings to find candidates and an LLM you already have to decide if they match.

from llm_join import fuzzy_join

result = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier_name",
    llm=my_llm, embed_fn=my_embed,
    context="company names",
    llm_concurrency=10,
)

Table of Contents


The Problem

You have two DataFrames. Same data, different text:

Your system Their system
Goldman Sachs & Co. The Goldman Sachs Group Inc
Sony WH-1000XM5 SONY-WH1000XM5-BLK
MSFT Q4 license renewal Microsoft Enterprise Agreement Q4-2024
Python programming Python (language)

pd.merge returns nothing. Fuzzy string matching gets the wrong answer. You end up writing custom logic — or doing it by hand.

llm-join solves this in one function call.


Why llm-join

vs. pd.merge

Exact string match only. Fails on any variation in naming.

vs. fuzzy string matching (fuzzywuzzy, rapidfuzz)

Character similarity, not meaning. "iPhone 14 Pro" vs "iPhone 14 Pro Max" scores high — but they are different products. "CABLE-USBC-200CM-BLK" vs "USB-C charging cable 2m black" scores near zero — even though they are the same item.

vs. embedding similarity alone

Fast and cheap, but no reasoning. Can't explain why two values match or catch false positives confidently.

llm-join

Embeddings narrow down candidates (fast, cheap). LLM makes the final call with your context (accurate). You get the best of both.


Real-World Use Cases

Domain Left table Right table Problem
Supply chain Buyer product description Supplier SKU Match products across 50+ vendor catalogs
Finance Expense report payee GL account / vendor master Reconcile transactions automatically
Legal / M&A Contract party name Corporate registry Identify true legal entity
Compliance Customer name OFAC sanctions list Sanctions screening at scale
Retail / e-commerce Marketplace product listing Master product catalog Deduplicate listings across 50+ sellers
Logistics Shipment description Harmonized tariff code Auto-classify goods at customs
Research Author name Citation database Disambiguate authors
Government Vendor name Tax registry Consolidate procurement spend
Real estate Raw address input Property records DB Standardize and match addresses
Pharma HR Medical Science Liaison roster (Veeva CRM) Employee master (Workday) Reconcile field medical staff assignments across CRM and HR for compliance audits — same person recorded as "Dr. Sarah J. Connor, MSL" in one system and "Connor, S." in the other

Install

pip install llm-join

Or from source:

git clone https://github.com/adityabalki/llm-join.git
cd llm-join
pip install -e .

Quick Start

import pandas as pd
import numpy as np
import openai
from llm_join import fuzzy_join

df1 = pd.DataFrame({
    "vendor": ["Goldman Sachs & Co.", "Amazon Web Services", "Microsoft Corp"],
    "spend": [1_200_000, 890_000, 340_000]
})

df2 = pd.DataFrame({
    "supplier_name": ["The Goldman Sachs Group Inc", "Amazon.com Inc.", "Microsoft Corporation"],
    "category": ["Finance", "Cloud", "Software"]
})

client = openai.OpenAI()

def my_llm(prompt):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

def my_embed(texts):
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in response.data], dtype="float32")

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants and abbreviations",
    llm_concurrency=10,   # how many LLM calls to run in parallel
    how="inner",
)

print(result)
vendor spend supplier_name category
Goldman Sachs & Co. 1,200,000 The Goldman Sachs Group Inc Finance
Amazon Web Services 890,000 Amazon.com Inc. Cloud
Microsoft Corp 340,000 Microsoft Corporation Software

How It Works

Example: A retailer's purchase orders use plain English product names. Their supplier sends a catalog with SKU codes. pd.merge matches nothing. llm-join bridges the gap.

orders_df = pd.DataFrame({"product_name": [
    "USB-C charging cable 2m black",
    "ergonomic mesh office chair",
    "27-inch 4K monitor",
], "qty": [500, 30, 12]})

catalog_df = pd.DataFrame({"sku": [
    "CABLE-USBC-200CM-BLK",
    "CABLE-USBA-200CM-BLK",
    "CHAIR-MESH-ERG-ADJUSTABLE",
    "CHAIR-TASK-FIXED-BLK",
    "MON-27-4K-IPS-HDMI2",
], "unit_price": [8.99, 6.49, 349.00, 189.00, 429.00]})

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name", right_on="sku",
    llm=my_llm, embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
    top_k=3, llm_threshold=0.7,
    llm_concurrency=10,
)

Step 1 — Embed both columns

Every value gets converted to a vector. No API call — pure math, runs in milliseconds.

Value Meaning captured
"USB-C charging cable 2m black" cable / USB-C / length / color
"ergonomic mesh office chair" seating / ergonomic / mesh
"27-inch 4K monitor" display / size / resolution
"CABLE-USBC-200CM-BLK" cable / USB-C / 200cm / black
"CHAIR-MESH-ERG-ADJUSTABLE" seating / mesh / ergonomic
"MON-27-4K-IPS-HDMI2" display / 27in / 4K

Step 2 — FAISS retrieves top-K candidates per row (no LLM)

For each left row, FAISS finds the top_k closest right vectors. Everything else is eliminated — no LLM call needed.

Query: "USB-C charging cable 2m black" → top_k=3

Rank Candidate Embed Score Reaches LLM?
0 CABLE-USBC-200CM-BLK 0.89 yes
1 CABLE-USBA-200CM-BLK 0.71 yes
2 CHAIR-TASK-FIXED-BLK 0.34 yes (shares "BLK")
CHAIR-MESH-ERG-ADJUSTABLE 0.11 eliminated
MON-27-4K-IPS-HDMI2 0.08 eliminated

Result: 3 LLM calls instead of 3 × 5 = 15 pair-by-pair calls.

Step 3 — One LLM call per left row scores all candidates

All top-K candidates go into a single prompt. The LLM scores each one and returns JSON — one API call per row, all candidates scored together.

Prompt sent for "USB-C charging cable 2m black":

Context: procurement — match buyer product descriptions to supplier SKU codes

LEFT: "USB-C charging cable 2m black"

Score each candidate (0.0–1.0):
0. CABLE-USBC-200CM-BLK
1. CABLE-USBA-200CM-BLK
2. CHAIR-TASK-FIXED-BLK

LLM response:

[
  {"index": 0, "score": 0.97, "reasoning": "USBC = USB-C, 200CM = 2m, BLK = black. Exact match on all three specs."},
  {"index": 1, "score": 0.38, "reasoning": "Correct length and color but USBA is USB-A, not USB-C. Wrong connector type."},
  {"index": 2, "score": 0.04, "reasoning": "This is a chair SKU. No relation to a cable."}
]

Apply llm_threshold=0.7:

Candidate LLM Score Decision
CABLE-USBC-200CM-BLK 0.97 best match — joined
CABLE-USBA-200CM-BLK 0.38 below llm_threshold
CHAIR-TASK-FIXED-BLK 0.04 below llm_threshold

Step 4 — Merge matched rows

print(result)
product_name qty sku unit_price
USB-C charging cable 2m black 500 CABLE-USBC-200CM-BLK 8.99
ergonomic mesh office chair 30 CHAIR-MESH-ERG-ADJUSTABLE 349.00
27-inch 4K monitor 12 MON-27-4K-IPS-HDMI2 429.00

The LLM correctly rejected CABLE-USBA-200CM-BLK (wrong connector) even though the embedding scored it 0.71. That's where the LLM earns its cost.


Cost & Scale

If you sent every possible pair to the LLM:

Left rows Right rows Pairs to score Cost (gpt-4o-mini)
1,000 10,000 10,000,000 ~$150
10,000 100,000 1,000,000,000 ~$15,000
100,000 1,000,000 100,000,000,000 not feasible

llm-join avoids this with a two-stage pipeline.

Stage 1 — Embeddings narrow the search (cheap)

FAISS finds the top-K most similar candidates per row. Pure math, no API calls, runs in milliseconds.

Input pairs 10,000 × 100,000 = 1,000,000,000
After FAISS (top_k=5) 50,000 candidate pairs
Eliminated for free 99.995% of all pairs

Stage 2 — LLM scores only the shortlist (accurate)

Your LLM sees a small batch of plausible candidates — not the full cross product. One call per left row, all candidates scored in that call.

Setup LLM calls Estimated cost
10k × 100k, top_k=5 50,000 ~$0.75
10k × 100k, top_k=3 30,000 ~$0.45
10k × 100k, embed_skip_threshold=0.95 ~5,000 ~$0.08

Cost controls

result = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier",
    llm=my_llm, embed_fn=my_embed,
    context="...",
    llm_concurrency=10,
    top_k=3,                    # fewer candidates = fewer LLM tokens per row
    embed_skip_threshold=0.95,  # skip LLM entirely if embed score is high enough
    max_llm_calls=1000,         # hard cap — warns and returns partial result if hit
)

Performance

llm_concurrency is a required parameter. You choose how many LLM calls run in parallel based on your API rate limit.

# Sequential — use for debugging or if you have strict rate limits
fuzzy_join(..., llm_concurrency=1)

# Good starting point for most APIs
fuzzy_join(..., llm_concurrency=10)

# High throughput — check your rate limits first
fuzzy_join(..., llm_concurrency=50)
API Suggested concurrency
OpenAI (tier 2+) 20–50
Azure OpenAI 10–30
Anthropic 5–20
Ollama (local) as high as your machine allows

How it works under the hood:

  • Sync LLM function (def my_llm) → uses ThreadPoolExecutor
  • Async LLM function (async def my_llm) → uses asyncio with a semaphore

Both paths have automatic embed fallback — if an LLM call fails after all retries, the top embedding match is used instead.


Usage

Basic join

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name",
    right_on="sku",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
    llm_concurrency=10,
)

With domain context

Give the LLM more detail about each column to improve accuracy.

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name",
    right_on="sku",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
    column_context={
        "product_name": "plain English product description written by a buyer",
        "sku": "supplier stock-keeping unit code, typically uppercase with hyphens",
    },
    llm_concurrency=10,
)

See why matches were made

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants",
    llm_concurrency=10,
    return_reasoning=True,
)

print(result[["vendor", "supplier_name", "_llm_score", "_llm_reasoning", "_match_method", "_llm_candidates"]])
vendor supplier_name _llm_score _llm_reasoning _match_method _llm_candidates
Goldman Sachs & Co. The Goldman Sachs Group Inc 0.97 same firm, legal name variant llm [{"candidate": "The Goldman Sachs...", "embed_score": 0.94}, ...]

_llm_candidates shows exactly which candidates were sent to the LLM and their embedding scores — useful for tuning top_k and threshold.

Control cost

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants",
    llm_concurrency=10,
    embed_skip_threshold=0.95,  # skip LLM if embedding match is strong enough
    max_llm_calls=500,          # hard cap — warns and returns partial result if hit
    top_k=3,               # fewer candidates = fewer LLM tokens
)

Left join (audit unmatched rows)

how="left" keeps all left rows. Unmatched ones get NaN in the right columns — useful for finding what failed to match.

result = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier_name",
    llm=my_llm, embed_fn=my_embed,
    context="company names — match legal entity variants",
    llm_concurrency=10,
    how="left",
)

unmatched = result[result["supplier_name"].isna()]

how="outer" is useful for reconciliation — unmatched left rows have no match above threshold; unmatched right rows were never selected as a best match.

Multi-column join key

Pass a list to left_on or right_on — values are concatenated automatically. Works for any number of columns.

# Match on product name + category combined
result = fuzzy_join(
    orders_df, catalog_df,
    left_on=["product_name", "category"],
    right_on="sku_description",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product + category to supplier SKU description",
    llm_concurrency=10,
)

Match all results

By default, only the best-scoring match above threshold is returned. Set match_all=True when one left value legitimately maps to multiple right values.

# "aspirin" should match all its known names
result = fuzzy_join(
    drugs_df, synonyms_df,
    left_on="generic_name", right_on="name",
    llm=my_llm, embed_fn=my_embed,
    context="pharmaceutical drug names — match generics to all known synonyms",
    llm_concurrency=10,
    match_all=True,
    top_k=10,
)
generic_name name _llm_score
aspirin acetylsalicylic acid 0.98
aspirin Bayer Aspirin Tablet 0.95
aspirin aspirin 100mg tablet 0.91

Chaining multiple joins

Each fuzzy_join returns a regular DataFrame — pipe them like pd.merge.

# Step 1 — match vendors
step1 = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier_name",
    llm=my_llm, embed_fn=my_embed,
    context="company names — match legal entity variants",
    llm_concurrency=10,
    how="left", return_reasoning=True,
)

# Step 2 — match products on result of step 1
result = fuzzy_join(
    step1, df3,
    left_on="product", right_on="catalog_item",
    llm=my_llm, embed_fn=my_embed,
    context="product names — match internal descriptions to catalog items",
    llm_concurrency=10,
    how="left", return_reasoning=True,
)

Works with Any LLM

Pass any callable that takes a prompt string and returns a string.

# OpenAI
import openai
client = openai.OpenAI()
def my_llm(prompt):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

# Anthropic
import anthropic
client = anthropic.Anthropic()
def my_llm(prompt):
    return client.messages.create(
        model="claude-opus-4-5", max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

# Google Gemini
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.0-flash")
def my_llm(prompt):
    return model.generate_content(prompt).text

# Ollama (local, free)
import ollama
def my_llm(prompt):
    return ollama.chat(
        model="llama3.2", messages=[{"role": "user", "content": prompt}]
    )["message"]["content"]

# Async LLM (uses asyncio path automatically)
async def my_llm(prompt):
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Works with Any Embedding Function

Pass any callable (list[str]) -> np.ndarray (shape [n, dim], dtype float32).

import numpy as np

# OpenAI
import openai
client = openai.OpenAI()
def my_embed(texts):
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in response.data], dtype="float32")

# Cohere
import cohere
co = cohere.Client("YOUR_KEY")
def my_embed(texts):
    response = co.embed(texts=texts, model="embed-english-v3.0", input_type="search_document")
    return np.array(response.embeddings, dtype="float32")

# sentence-transformers (local, no API key needed)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def my_embed(texts):
    return model.encode(texts, convert_to_numpy=True).astype("float32")

Features

  • Semantic matching — joins on meaning, not character similarity
  • Two-stage pipeline — embeddings cut 99%+ of candidates; LLM scores only the plausible ones
  • Domain contextcontext and column_context tell the LLM what the columns mean, improving accuracy
  • Bring your own LLM — any callable (str) -> str; sync or async
  • Bring your own embeddings — any callable (list[str]) -> np.ndarray
  • Full join semanticsinner, left, right, outer — same as pd.merge
  • Multi-column join keys — pass a list to left_on / right_on; any number of columns
  • Tie handling — when candidates score equally, all are returned (correct SQL join behavior)
  • Reasoning outputreturn_reasoning=True adds _llm_score, _llm_reasoning, _embed_rank, _match_method, _llm_candidates
  • Debug candidates_llm_candidates shows exactly what was sent to the LLM with embed scores, for tuning
  • Embed skip shortcutembed_skip_threshold skips LLM when embed similarity is high enough (default 1.0 = only exact matches skip)
  • Embed fallback — if LLM fails all retries, top embed candidate is used automatically
  • Cost controlstop_k, embed_skip_threshold, max_llm_calls
  • Multi-match modematch_all=True returns all candidates above threshold
  • Parallel LLM callsllm_concurrency runs multiple LLM calls at once
  • Retry with backoff — failed LLM calls retry with exponential backoff (1s, 2s, 4s…)
  • MIT license

Parameters

Parameter Default Description
left_on required Column name(s) in df1. Pass a list for multi-column keys — any number of columns supported.
right_on required Column name(s) in df2. Pass a list for multi-column keys — any number of columns supported.
llm required Your LLM function. Sync: (str) -> str. Async: async (str) -> str.
embed_fn required Your embedding function. (list[str]) -> np.ndarray of shape [n, dim], dtype float32.
context required Describe what the columns represent and what kind of match to make. Injected into every LLM prompt.
llm_concurrency required How many LLM calls to run in parallel. 1 = sequential. Start with 10 and adjust based on your API rate limit.
column_context {} Per-column descriptions {"col": "description"} — adds extra detail to the prompt beyond context.
top_k 5 How many embedding candidates to retrieve per left row before LLM scoring.
llm_threshold 0.7 Minimum LLM score (0–1) to accept a match. Rows where LLM scores below this are not joined.
how "inner" Join type: inner / left / right / outer — same as pd.merge.
embed_skip_threshold 1.0 Skip LLM when top embed similarity is at or above this value. Default 1.0 means only identical vectors (exact same text) skip LLM. Lower it (e.g. 0.92) to skip LLM for near-identical matches and save cost.
max_llm_calls None Hard cap on total LLM calls. Emits a warning and returns a partial result if hit.
max_retries 3 How many times to retry a failed LLM call (exponential backoff). Set 0 to disable. Falls back to top embed candidate on total failure.
batch_size 32 Batch size for embed_fn calls.
match_all False Return all candidates above threshold, not just the best. Use when one left value maps to multiple right values.
return_reasoning False Add debug columns: _llm_score, _llm_reasoning, _embed_rank, _match_method, _llm_candidates.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_join-0.4.0.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_join-0.4.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_join-0.4.0.tar.gz.

File metadata

  • Download URL: llm_join-0.4.0.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_join-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9aff8078f80da67daaee86c15d3c0c1c40054ecb26c4b558134e8df007aac576
MD5 00aec5c4f44ea9cbb41817b7e5d789ff
BLAKE2b-256 6cfd44b3225ceb0f5c44cc2fd67c63e84ad237db62784711e195ccb69f3e8d1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_join-0.4.0.tar.gz:

Publisher: ci.yml on adityabalki/llm-join

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_join-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: llm_join-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_join-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18bca272991d99855ade0d20bc1cc5314af249fa3fa5693e242ecfdfcb9f35b8
MD5 eadeb8bd3a2299e15d4027430524dff9
BLAKE2b-256 20199744f950870a4d2f2b68509d3a881b54a211bd158269073a2dc0f9b96b7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_join-0.4.0-py3-none-any.whl:

Publisher: ci.yml on adityabalki/llm-join

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page