Skip to main content

Fuzzy join DataFrames using LLM scoring and embedding retrieval

Project description

llm-join

The pandas join that understands what your data means.

pd.merge joins on exact values. llm-join joins on meaning — using embeddings to find candidates and an LLM you already have to decide if they match.

from llm_join import fuzzy_join

result = fuzzy_join(df1, df2, left_on="vendor", right_on="supplier_name", llm=my_llm, embed_fn=my_embed, context="...")

Table of Contents


The Problem

You have two DataFrames. Same data, different text:

Your system Their system
Goldman Sachs & Co. The Goldman Sachs Group Inc
Sony WH-1000XM5 SONY-WH1000XM5-BLK
MSFT Q4 license renewal Microsoft Enterprise Agreement Q4-2024
Python programming Python (language)

pd.merge returns nothing. Fuzzy string matching gets the wrong answer. You end up writing custom logic — or doing it by hand.

llm-join solves this in one line.


Why llm-join

vs. pd.merge

Exact string match only. Fails on any variation in naming.

vs. fuzzy string matching (fuzzywuzzy, rapidfuzz)

Character similarity, not semantic meaning. "iPhone 14 Pro" vs "iPhone 14 Pro Max" scores high — but they are different products. "CABLE-USBC-200CM-BLK" vs "USB-C charging cable 2m black" scores near zero — even though they are the same item.

vs. embedding similarity alone

Fast and cheap, but no reasoning. Can't explain why two values match or catch false positives confidently.

llm-join

Embeddings narrow down candidates (fast, cheap). LLM makes the final call with context (accurate). You get the best of both.


Real-World Use Cases

Domain Left table Right table Problem
Supply chain Buyer product description Supplier SKU Match products across 50+ vendor catalogs
Finance Expense report payee GL account / vendor master Reconcile transactions automatically
Legal / M&A Contract party name Corporate registry Identify true legal entity
Compliance Customer name OFAC sanctions list Sanctions screening at scale
Retail / e-commerce Marketplace product listing Master product catalog Deduplicate listings across 50+ sellers
Logistics Shipment description Harmonized tariff code Auto-classify goods at customs
Research Author name Citation database Disambiguate authors
Government Vendor name Tax registry Consolidate procurement spend
Real estate Raw address input Property records DB Standardize and match addresses

Install

pip install llm-join
# Or install from source
git clone https://github.com/adityabalki/llm-join.git
cd llm-join
pip install -e .

# Or install from wheel (air-gapped machines)
pip install llm_join-0.2.0-py3-none-any.whl

Quick Start

import pandas as pd
import numpy as np
import openai
from llm_join import fuzzy_join

df1 = pd.DataFrame({
    "vendor": ["Goldman Sachs & Co.", "Amazon Web Services", "Microsoft Corp"],
    "spend": [1_200_000, 890_000, 340_000]
})

df2 = pd.DataFrame({
    "supplier_name": ["The Goldman Sachs Group Inc", "Amazon.com Inc.", "Microsoft Corporation"],
    "category": ["Finance", "Cloud", "Software"]
})

client = openai.OpenAI()

def my_llm(prompt):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

def my_embed(texts):
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in response.data], dtype="float32")

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants and abbreviations",
    how="inner",   # "inner" | "left" | "right" | "outer"
)

print(result)
vendor spend supplier_name category
Goldman Sachs & Co. 1,200,000 The Goldman Sachs Group Inc Finance
Amazon Web Services 890,000 Amazon.com Inc. Cloud
Microsoft Corp 340,000 Microsoft Corporation Software

How It Works

Example: A retailer's internal purchase orders use plain English product names. Their supplier sends a catalog with SKU codes. pd.merge matches nothing. llm-join bridges the gap.

orders_df = pd.DataFrame({"product_name": [
    "USB-C charging cable 2m black",
    "ergonomic mesh office chair",
    "27-inch 4K monitor",
], "qty": [500, 30, 12]})

catalog_df = pd.DataFrame({"sku": [
    "CABLE-USBC-200CM-BLK",
    "CABLE-USBA-200CM-BLK",
    "CHAIR-MESH-ERG-ADJUSTABLE",
    "CHAIR-TASK-FIXED-BLK",
    "MON-27-4K-IPS-HDMI2",
], "unit_price": [8.99, 6.49, 349.00, 189.00, 429.00]})

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name", right_on="sku",
    llm=my_llm, embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
    top_k=3, threshold=0.7,
)

Step 1 — Embed both columns

Every value is converted to a vector. No API call — pure math, milliseconds.

Value Meaning captured in vector
"USB-C charging cable 2m black" cable / USB-C / length / color
"ergonomic mesh office chair" seating / ergonomic / mesh
"27-inch 4K monitor" display / size / resolution
"CABLE-USBC-200CM-BLK" cable / USB-C / 200cm / black
"CHAIR-MESH-ERG-ADJUSTABLE" seating / mesh / ergonomic
"MON-27-4K-IPS-HDMI2" display / 27in / 4K

Step 2 — FAISS retrieves top-K candidates (no LLM)

For each left row, faiss finds the top_k closest right vectors by cosine similarity. Everything else is eliminated — no LLM call needed.

Query: "USB-C charging cable 2m black" → top_k=3

Rank Candidate Embed Score Reaches LLM?
0 CABLE-USBC-200CM-BLK 0.89 ✓ yes
1 CABLE-USBA-200CM-BLK 0.71 ✓ yes
2 CHAIR-TASK-FIXED-BLK 0.34 ✓ yes (shares "BLK")
CHAIR-MESH-ERG-ADJUSTABLE 0.11 ✗ eliminated
MON-27-4K-IPS-HDMI2 0.08 ✗ eliminated

Query: "ergonomic mesh office chair" → top_k=3

Rank Candidate Embed Score Reaches LLM?
0 CHAIR-MESH-ERG-ADJUSTABLE 0.91 ✓ yes
1 CHAIR-TASK-FIXED-BLK 0.74 ✓ yes
2 CABLE-USBC-200CM-BLK 0.19 ✓ yes
MON-27-4K-IPS-HDMI2 0.06 ✗ eliminated
CABLE-USBA-200CM-BLK 0.14 ✗ eliminated

Query: "27-inch 4K monitor" → top_k=3

Rank Candidate Embed Score Reaches LLM?
0 MON-27-4K-IPS-HDMI2 0.94 ✓ yes
1 CABLE-USBC-200CM-BLK 0.22 ✓ yes
2 CHAIR-TASK-FIXED-BLK 0.17 ✓ yes
CHAIR-MESH-ERG-ADJUSTABLE 0.09 ✗ eliminated
CABLE-USBA-200CM-BLK 0.12 ✗ eliminated

Result: 3 LLM calls instead of 3 × 5 = 15 pair-by-pair calls.

Step 3 — One LLM call per left row scores all candidates

All top-K candidates go into a single prompt. LLM returns a JSON array — one API call, all candidates scored.

Prompt sent for "USB-C charging cable 2m black":

Context: procurement — match buyer product descriptions to supplier SKU codes

LEFT: "USB-C charging cable 2m black"

Score each candidate (0.0–1.0):
0. CABLE-USBC-200CM-BLK
1. CABLE-USBA-200CM-BLK
2. CHAIR-TASK-FIXED-BLK

LLM response:

[
  {"index": 0, "score": 0.97, "reasoning": "USBC = USB-C, 200CM = 2m, BLK = black. Exact match on all three specs."},
  {"index": 1, "score": 0.38, "reasoning": "Correct length and color but USBA is USB-A, not USB-C. Wrong connector type."},
  {"index": 2, "score": 0.04, "reasoning": "This is a chair SKU. No relation to a cable."}
]

Apply threshold=0.7:

Candidate LLM Score Decision
CABLE-USBC-200CM-BLK 0.97 best match — joined
CABLE-USBA-200CM-BLK 0.38 ✗ below threshold
CHAIR-TASK-FIXED-BLK 0.04 ✗ below threshold

Step 4 — Merge matched rows

print(result)
product_name qty sku unit_price
USB-C charging cable 2m black 500 CABLE-USBC-200CM-BLK 8.99
ergonomic mesh office chair 30 CHAIR-MESH-ERG-ADJUSTABLE 349.00
27-inch 4K monitor 12 MON-27-4K-IPS-HDMI2 429.00

The LLM correctly rejected CABLE-USBA-200CM-BLK (wrong connector) even though embedding scored it 0.71 — this is the case where LLM reasoning earns its cost.


Cost & Scale

The problem with naive LLM joins

If you sent every possible pair to the LLM:

Left rows Right rows Pairs to score Cost (gpt-4o-mini ~$0.30/1M tokens)
1,000 10,000 10,000,000 ~$150
10,000 100,000 1,000,000,000 ~$15,000
100,000 1,000,000 100,000,000,000 impossible

llm-join solves this with a two-stage pipeline.

Stage 1: Embeddings narrow the search (cheap)

Convert every value to a vector. Use faiss to find the top-K most similar candidates per row. This is pure math — no API calls, runs in milliseconds.

Input pairs 10,000 × 100,000 = 1,000,000,000
After embed + faiss (top_k=5) 50,000 candidate pairs
Eliminated for free 99.995% of all pairs

Stage 2: LLM scores only the hard cases (accurate)

Your LLM sees a small batch of plausible candidates per row — not the full cross product. It scores each candidate; the highest score above threshold wins.

Query: "USB-C charging cable 2m black"

Rank Candidate LLM Score Decision
0 CABLE-USBC-200CM-BLK 0.97 best match — joined
1 CABLE-USBC-100CM-BLK 0.71 scored, not selected (wrong length)
2 CABLE-USBA-200CM-BLK 0.38 scored, not selected (wrong connector)
3 CHAIR-TASK-FIXED-BLK 0.04 ✗ below threshold
4 MON-27-4K-IPS-HDMI2 0.01 ✗ below threshold

Real cost example

Setup LLM calls Estimated cost
10k × 100k, top_k=5 50,000 ~$0.75
10k × 100k, top_k=3 30,000 ~$0.45
10k × 100k, embed_threshold=0.95 ~5,000 (obvious matches skip LLM) ~$0.08

Further cost controls

result = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier",
    llm=my_llm,
    embed_fn=my_embed,
    context="...",
    top_k=3,                # fewer candidates = fewer LLM tokens per row
    embed_threshold=0.95,   # skip LLM entirely if embedding match score > 0.95
    max_llm_calls=1000,     # hard cap — warns and returns partial result if hit
)
Parameter Effect
top_k=3 (default 5) 40% fewer LLM tokens
embed_threshold=0.95 Skip LLM for obvious matches — typically saves 30–60%
max_llm_calls=N Budget guard — never exceeds N LLM calls

Usage

Basic join

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name",
    right_on="sku",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
)

With domain context

result = fuzzy_join(
    orders_df, catalog_df,
    left_on="product_name",
    right_on="sku",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product descriptions to supplier SKU codes",
    column_context={
        "product_name": "plain English product description written by a buyer",
        "sku": "supplier stock-keeping unit code, typically uppercase with hyphens",
    },
)

See why matches were made

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants",
    return_reasoning=True,
)

print(result[["vendor", "supplier_name", "_llm_score", "_llm_reasoning", "_embed_rank", "_match_method"]])
vendor supplier_name _llm_score _llm_reasoning _embed_rank _match_method
Goldman Sachs & Co. The Goldman Sachs Group Inc 0.97 same firm, legal name variant 0 llm

Control cost

result = fuzzy_join(
    df1, df2,
    left_on="vendor",
    right_on="supplier_name",
    llm=my_llm,
    embed_fn=my_embed,
    context="company names — match legal entity variants",
    embed_threshold=0.95,   # skip LLM if embedding match is obvious
    max_llm_calls=500,      # hard cap — warns and returns partial result if hit
    top_k=3,                # fewer candidates = fewer LLM tokens
)

Left join (audit unmatched rows)

how="left" keeps all left rows — unmatched ones get NaN right columns. Useful to see what failed to match.

result = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier_name",
    llm=my_llm, embed_fn=my_embed,
    context="company names — match legal entity variants",
    how="left",
)

unmatched = result[result["supplier_name"].isna()]

Note: how="outer" is useful for reconciliation — unmatched left rows are values with no match above threshold; unmatched right rows are values never selected as a best match. cross join is not supported (it would be the naive O(n×m) approach llm-join is designed to avoid).

Multi-column join key

# orders_df has separate "product_name" and "category" columns
# catalog_df has a single "sku_description" column

result = fuzzy_join(
    orders_df, catalog_df,
    left_on=["product_name", "category"],   # concatenated into one key
    right_on="sku_description",
    llm=my_llm,
    embed_fn=my_embed,
    context="procurement — match buyer product + category to supplier SKU description",
)

Chaining multiple joins

Each fuzzy_join returns a regular DataFrame — pipe them like pd.merge.

# Step 1 — match vendors
step1 = fuzzy_join(
    df1, df2,
    left_on="vendor", right_on="supplier_name",
    llm=my_llm, embed_fn=my_embed,
    context="company names — match legal entity variants",
    how="left", return_reasoning=True,
)
step1 = step1.rename(columns={
    "_llm_score": "_vendor_score",
    "_llm_reasoning": "_vendor_reasoning",
    "_match_method": "_vendor_method",
})

# Step 2 — match products on result of step 1
result = fuzzy_join(
    step1, df3,
    left_on="product", right_on="catalog_item",
    llm=my_llm, embed_fn=my_embed,
    context="product names — match internal descriptions to catalog items",
    how="left", return_reasoning=True,
)

Works with Any LLM

Pass any callable that takes a prompt string and returns a string.

# OpenAI
import openai
client = openai.OpenAI()
llm = lambda p: client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": p}]
).choices[0].message.content

# Anthropic
import anthropic
client = anthropic.Anthropic()
llm = lambda p: client.messages.create(
    model="claude-opus-4-5", max_tokens=512,
    messages=[{"role": "user", "content": p}]
).content[0].text

# Google Gemini
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.0-flash")
llm = lambda p: model.generate_content(p).text

# Ollama (local, free)
import ollama
llm = lambda p: ollama.chat(
    model="llama3.2", messages=[{"role": "user", "content": p}]
)["message"]["content"]

# Any custom endpoint
import requests
llm = lambda p: requests.post(
    "https://your-llm-api.com/chat",
    json={"prompt": p}
).json()["response"]

Works with Any Embedding Function

Pass any callable (list[str]) -> np.ndarray (shape [n, dim], dtype float32).

import numpy as np

# OpenAI
import openai
client = openai.OpenAI()
def my_embed(texts):
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in response.data], dtype="float32")

# Cohere
import cohere
co = cohere.Client("YOUR_KEY")
def my_embed(texts):
    response = co.embed(texts=texts, model="embed-english-v3.0", input_type="search_document")
    return np.array(response.embeddings, dtype="float32")

# sentence-transformers (local, no API key)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def my_embed(texts):
    return model.encode(texts, convert_to_numpy=True).astype("float32")

# Any custom endpoint
import requests
def my_embed(texts):
    response = requests.post(
        "https://your-embed-api.com/embed",
        json={"texts": texts}
    ).json()
    return np.array(response["embeddings"], dtype="float32")

Features

  • Semantic matching — joins on meaning, not character similarity
  • Two-stage pipeline — embeddings eliminate 99%+ of candidates cheaply; LLM scores only the hard cases
  • Domain context injectioncontext and column_context tell the LLM what the columns represent, improving accuracy
  • Bring-your-own LLM — any callable (str) -> str; works with OpenAI, Anthropic, Gemini, Ollama, or any custom endpoint
  • Bring-your-own embeddings — any callable (list[str]) -> np.ndarray; works with any embedding model or API
  • Full join semanticsinner, left, right, outer — same API as pd.merge
  • Multi-column join keys — pass a list to left_on / right_on; values are concatenated automatically
  • Tie handling — when multiple candidates score equally, all are returned (correct join semantics)
  • Reasoning outputreturn_reasoning=True adds _llm_score, _llm_reasoning, _embed_rank, _match_method columns
  • Embed threshold shortcutembed_threshold skips LLM entirely for obvious matches, saving cost
  • Embed fallback — if LLM fails all retries, falls back to top embed candidate automatically (_match_method="embed_fallback")
  • Cost controlstop_k, embed_threshold, max_llm_calls give full budget control
  • Retry with backoffmax_retries retries failed LLM calls with exponential backoff (1s, 2s, 4s…)
  • MIT license — use in commercial projects without restriction

Parameters

Parameter Default Description
left_on required Column name(s) in df1. Pass a list for multi-column keys.
right_on required Column name(s) in df2. Pass a list for multi-column keys.
llm required Callable (prompt: str) -> str — your LLM function
embed_fn required Callable (list[str]) -> np.ndarray — your embedding function
context required Domain context injected into LLM prompt — describe what the columns represent and what kind of match to make
column_context {} Per-column context dict {"col": "description"} — adds column-level detail to the prompt
top_k 5 Number of embedding candidates retrieved per left row before LLM scoring
threshold 0.7 Minimum LLM score (0–1) to accept a match. Scores below this are rejected.
how "inner" Join type: inner (matched pairs only) / left (all left rows) / right (all right rows) / outer (all rows both sides)
embed_threshold None If set, skip LLM when top embed score ≥ this value — match accepted as-is
max_llm_calls None Hard cap on total LLM calls. Emits a warning and returns partial result if hit.
max_retries 3 Retry failed LLM calls with exponential backoff (1s, 2s, 4s…). Set 0 to disable. On full failure, falls back to top embed candidate.
batch_size 32 Embedding batch size for embed_fn calls
return_reasoning False Append debug columns: _llm_score, _llm_reasoning, _embed_rank, _match_method ("llm" / "embed_threshold" / "embed_fallback")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_join-0.2.2.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_join-0.2.2-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_join-0.2.2.tar.gz.

File metadata

  • Download URL: llm_join-0.2.2.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_join-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d510025a64683bc132f0d0923f9dcd2dcbcc948afc2cae191a9b1855b07d550c
MD5 7ec499faf4bb50cadc102a19ff2bf10b
BLAKE2b-256 09f006c24a3806b8e43a0b6b08603a1711097ac9e12f1c1ef942b03787fe1550

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_join-0.2.2.tar.gz:

Publisher: ci.yml on adityabalki/llm-join

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_join-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: llm_join-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_join-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 08caaf600d0105bf76cfc6f6055802ed0aee1ed9d91bfd8ab1f27950b6430108
MD5 52264222942548a7e7e8d608655014eb
BLAKE2b-256 e1f6df52020619d09a3c6f1516207f65f14c331d51a39e6b6a6d3bb86b2772ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_join-0.2.2-py3-none-any.whl:

Publisher: ci.yml on adityabalki/llm-join

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page