Skip to main content

Inference-driven schema mapping engine

Project description

infermap

Inference-driven schema mapping engine.
Map messy source columns to a known target schema โ€” accurately, explainably, with zero config.

PyPI npm PyPI downloads npm downloads CI

Python 3.11+ Node 20+ TypeScript Edge runtime Parity License: MIT

๐Ÿ“– Wiki ยท ๐ŸŒ Docs ยท ๐Ÿงช Examples ยท ๐Ÿ’ฌ Discussions ยท ๐Ÿ› Issues


infermap is a schema-mapping engine. Give it any two field collections (CSVs, DataFrames, database tables, in-memory records) and it figures out which source field corresponds to which target field, with confidence scores and human-readable reasoning. Available as a Python package on PyPI and a TypeScript package on npm, with mapping decisions verified bit-for-bit by a shared golden-test parity suite.

Table of contents

Install

Python

pip install infermap

Optional database extras:

pip install infermap[postgres]   # psycopg2-binary
pip install infermap[mysql]      # mysql-connector-python
pip install infermap[duckdb]     # duckdb
pip install infermap[all]        # all extras

TypeScript / Next.js

npm install infermap

Zero runtime dependencies in the core entrypoint. Compatible with Next.js Server Components, Route Handlers, Server Actions, and the Edge Runtime out of the box. See the package README for the full reference.

Quick start

Python

import infermap

# Map a CRM export CSV to a canonical customer schema
result = infermap.map("crm_export.csv", "canonical_customers.csv")

for m in result.mappings:
    print(f"{m.source} -> {m.target}  ({m.confidence:.0%})")
# fname -> first_name  (97%)
# lname -> last_name   (95%)
# email_addr -> email  (91%)

# Apply mappings to rename DataFrame columns
import polars as pl
df = pl.read_csv("crm_export.csv")
renamed = result.apply(df)

# Save mappings to a reusable config file
result.to_config("my_mapping.yaml")

# Reload later โ€” no re-inference needed
saved = infermap.from_config("my_mapping.yaml")

TypeScript

import { map } from "infermap";

const crm = [
  { fname: "John", lname: "Doe", email_addr: "j@d.co" },
  { fname: "Jane", lname: "Smith", email_addr: "j@s.co" },
];

const canonical = [
  { first_name: "", last_name: "", email: "" },
];

const result = map({ records: crm }, { records: canonical });

for (const m of result.mappings) {
  console.log(`${m.source} โ†’ ${m.target}  (${m.confidence.toFixed(2)})`);
}
// fname       โ†’ first_name  (0.44)
// lname       โ†’ last_name   (0.48)
// email_addr  โ†’ email       (0.69)

For Next.js, drop it directly into a Route Handler โ€” works on Edge Runtime with zero config:

// app/api/infer/route.ts
import { map } from "infermap";
export const runtime = "edge";

export async function POST(req: Request) {
  const { sourceCsv, targetCsv } = await req.json();
  const result = map({ csvText: sourceCsv }, { csvText: targetCsv });
  return Response.json(result);
}

How it works

Each field pair runs through a pipeline of 7 scorers. Each scorer returns a score in [0.0, 1.0] or abstains (None/null). The engine combines scores via weighted average (requiring at least 2 contributors), then uses the Hungarian algorithm for optimal one-to-one assignment.

Scorer Weight What it detects
ExactScorer 1.0 Case-insensitive exact name match
AliasScorer 0.95 Known field aliases (fname โ†” first_name, tel โ†” phone) + domain dictionaries
InitialismScorer 0.75 Abbreviation-style names (assay_id โ†” ASSI, confidence_score โ†” CONSC)
PatternTypeScorer 0.7 Semantic type from sample values โ€” email, date_iso, phone, uuid, url, zip, currency
ProfileScorer 0.5 Statistical profile similarity โ€” dtype, null rate, unique rate, length, cardinality
FuzzyNameScorer 0.4 Jaro-Winkler similarity on normalized field names (with common-prefix canonicalization)
LLMScorer 0.8 Pluggable LLM-backed scorer (stubbed by default)

The engine also applies common-prefix canonicalization โ€” automatically stripping schema-wide prefixes like prospect_ so that City vs prospect_City is compared as City vs City. And optional confidence calibration transforms raw scores into calibrated probabilities post-assignment (ECE from 0.46 to 0.005 on real-world data).

Read the full architecture โ†’

Features

Python TypeScript
7 built-in scorers โœ… โœ…
Hungarian assignment โœ… (scipy) โœ… (vendored)
Custom scorers @infermap.scorer defineScorer()
Domain dictionaries โœ… (YAML) โœ… (inlined)
Confidence calibration โœ… (Identity/Isotonic/Platt) โœ…
Score matrix inspection โœ… โœ…
In-memory data Polars, Pandas, list[dict] Array<Record>
File providers CSV, Parquet, XLSX CSV, JSON
Schema definition files YAML + JSON JSON
Database providers SQLite, Postgres, DuckDB SQLite, Postgres, DuckDB
Engine config YAML JSON
Saved mapping format YAML JSON
CLI โœ… (Typer) โœ… (node:util)
Apply to DataFrame โœ… โŒ (CSV rewrite via CLI)
Edge-runtime compatible โŒ โœ…
Zero runtime deps n/a โœ…
Accuracy benchmark โœ… (162 cases, F1 0.84) โœ… (parity within 0.0005)

Full feature parity matrix โ†’

Which package should I use?

If you areโ€ฆ Use
Building a Python data pipeline or notebook Python
Building a Next.js app, Node service, or browser tool TypeScript
Running mapping in a serverless edge function TypeScript (zero Node built-ins)
Doing ad-hoc CSV exploration on the command line Python CLI has more features; TS CLI is leaner
Both โ€” Python backend + Next.js admin UI Both โ€” outputs are interoperable via the JSON config format

What's new in v0.3

+18.3pp F1 on real-world data from four compounding improvements:

v0.2 baseline    F1 0.657
+ min_conf 0.2   F1 0.765  (+10.8pp โ€” empirically tuned threshold)
+ prefix-strip   F1 0.821  (+5.6pp  โ€” City vs prospect_City now works)
+ InitialismScorer F1 0.840 (+1.9pp  โ€” ASSI, CONSC, RELATIT now work)

New features:

  • Domain dictionaries โ€” MapEngine(domains=["healthcare"]) loads curated aliases for your domain. Ships: generic (default), healthcare, finance, ecommerce. See examples/09_domain_dictionaries.py.
  • Confidence calibration โ€” MapEngine(calibrator=cal) transforms raw scores into calibrated probabilities. Ships: IsotonicCalibrator, PlattCalibrator. Valentine ECE: 0.46 โ†’ 0.005. See examples/10_calibration.py.
  • InitialismScorer โ€” matches abbreviation-style column names (assay_id โ†” ASSI). ChEMBL F1: 0.524 โ†’ 0.819.
  • Common-prefix canonicalization โ€” automatically strips prospect_, assays_, etc. before fuzzy matching.
  • Valentine corpus โ€” 82 real-world schema-matching cases from the Valentine benchmark for accuracy testing.
  • Full TypeScript parity โ€” all new features ported. 186 TS tests. Benchmark F1 within 0.0005 of Python.

Custom scorers

Python

import infermap
from infermap.types import FieldInfo, ScorerResult

@infermap.scorer("prefix_scorer", weight=0.8)
def prefix_scorer(source: FieldInfo, target: FieldInfo) -> ScorerResult | None:
    if source.name[:3].lower() != target.name[:3].lower():
        return None
    return ScorerResult(score=0.85, reasoning=f"Shared prefix '{source.name[:3]}'")

from infermap.engine import MapEngine
from infermap.scorers import default_scorers

engine = MapEngine(scorers=[*default_scorers(), prefix_scorer])

TypeScript

import { MapEngine, defaultScorers, defineScorer, makeScorerResult } from "infermap";

const prefixScorer = defineScorer(
  "prefix_scorer",
  (source, target) => {
    if (source.name.slice(0, 3).toLowerCase() !== target.name.slice(0, 3).toLowerCase()) {
      return null;
    }
    return makeScorerResult(0.85, `Shared prefix '${source.name.slice(0, 3)}'`);
  },
  0.8 // weight
);

const engine = new MapEngine({
  scorers: [...defaultScorers(), prefixScorer],
});

CLI examples

The CLI works the same way in both packages:

# Map two files and print a report
infermap map crm_export.csv canonical_customers.csv

# Map and save the config (Python: --save, TS: -o)
infermap map crm_export.csv canonical_customers.csv -o mapping.json

# Apply a saved mapping to rename columns
infermap apply crm_export.csv --config mapping.json --output renamed.csv

# Inspect the schema of a file or DB table
infermap inspect crm_export.csv
infermap inspect "sqlite:///mydb.db" --table customers

# Validate a saved config against a source
infermap validate crm_export.csv --config mapping.json --required email,id --strict

Config reference

Both packages accept an engine config (scorer weight overrides + alias extensions). Python uses YAML, TypeScript uses JSON; the shape is identical.

# Python: infermap.yaml
domains:
  - healthcare
  - finance
scorers:
  LLMScorer:
    enabled: false
  FuzzyNameScorer:
    weight: 0.3
aliases:
  order_id:
    - order_num
    - ord_no
// TypeScript: infermap.config.json
{
  "scorers": {
    "LLMScorer":       { "enabled": false },
    "FuzzyNameScorer": { "weight": 0.3 }
  },
  "aliases": {
    "order_id": ["order_num", "ord_no"]
  }
}

See infermap.yaml.example for a full annotated reference.

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infermap-0.3.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infermap-0.3.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file infermap-0.3.0.tar.gz.

File metadata

  • Download URL: infermap-0.3.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infermap-0.3.0.tar.gz
Algorithm Hash digest
SHA256 00dbe9d63d462d7c73534cdd0f21a8ca51e58e2d1797a5e94c5a554ed885c6df
MD5 c6756bbe3d9ab15a6b1be680b173ff87
BLAKE2b-256 fad8ec7e8354e3458c1771786684ed0bb93dbce5482ef444b889830c69b70891

See more details on using hashes here.

Provenance

The following attestation bundles were made for infermap-0.3.0.tar.gz:

Publisher: publish.yml on benzsevern/infermap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file infermap-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: infermap-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for infermap-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2955532580c8f6423de8db7a54a9ef4b341c01863e24cd98d9cd5847cbe5e7ea
MD5 a669d3d5c92686eb6ef96baf9da2e0fc
BLAKE2b-256 175dc30e1b9d8a1a478c511e84591c9578ed3d702a629dc904f0915c7b1091a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for infermap-0.3.0-py3-none-any.whl:

Publisher: publish.yml on benzsevern/infermap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page