Skip to main content

Identity resolution for AI applications - resolve duplicates in 10 lines of Python

Project description

c1v-id

Identity resolution for AI applications

PyPI version Python 3.10+ License: MIT

AI agents that interact with customers, CRMs, or any system of record face a critical decision point: is this person already in our system, or should we create a new record? Because both input and existing data are often messy, agents can confuse customer records, pollute data with duplicates, or deliver poor customer experiences.

c1v-id is an open-source identity resolution library that sits between the agent and the system of record, answering identity queries in milliseconds. It uses probabilistic record linkage with blocking strategies (~O(n) vs naive O(n²)), weighted multi-field scoring, and transitive clustering. Designed as a drop-in for LangChain agents, n8n workflows, and RAG pipelines. Zero ML dependencies. Configurable survivorship rules.

Installation

pip install c1v-id

Quick Start

Resolve duplicates in 10 lines of Python:

from c1v_id import IdentityResolver

resolver = IdentityResolver()

records = [
    {"email": "john@gmail.com", "name": "John Doe", "phone": "555-1234"},
    {"email": "john@gmail.com", "name": "J. Doe", "phone": "555-1234"},
    {"email": "jane@gmail.com", "name": "Jane Smith"},
]

golden = resolver.resolve(records)
print(f"Input: {len(records)} records → Output: {len(golden)} golden records")
# Input: 3 records → Output: 2 golden records

Match Two Records

result = resolver.match(
    {"email": "john@gmail.com", "name": "John"},
    {"email": "john@gmail.com", "name": "Johnny"}
)

print(result.score)       # 0.97
print(result.decision)    # 'auto_merge'
print(result.matched_on)  # ['email', 'name']

Find Matches in Existing Data

incoming = {"email": "john@gmail.com", "name": "John"}
existing = [
    {"id": "1", "email": "john@gmail.com", "name": "John Doe"},
    {"id": "2", "email": "jane@gmail.com", "name": "Jane Doe"},
]

matches = resolver.find_matches(incoming, existing)
# Returns best matches sorted by score

Custom Configuration

from c1v_id import IdentityResolver, ResolverConfig, Thresholds, Weights

config = ResolverConfig(
    thresholds=Thresholds(auto_merge=0.95, needs_review=0.8),
    weights=Weights(email=0.6, phone=0.3, name=0.1, address=0.0),
)

resolver = IdentityResolver(config=config)

Why c1v-id?

vs. Splink

c1v-id Splink
Hello World 10 lines 50+ lines
Target AI builders Data analysts
Setup pip install Spark/DuckDB config
ML Required No Optional
Use Case Real-time matching Batch analytics

Splink is powerful for large-scale data linkage projects with dedicated analysts. c1v-id is for developers who need identity resolution as a feature, not a project.

vs. dedupe

c1v-id dedupe
Maintenance Active Stale (2+ years)
Dependencies 3 (pandas, rapidfuzz, pyyaml) 10+
Learning Curve Minimal Requires training data
API Style resolve(records) Iterative labeling

dedupe requires interactive labeling to train a model. c1v-id works out of the box with sensible defaults.

vs. Enterprise CDPs (Segment, mParticle)

c1v-id Enterprise CDP
Cost Free $100K+/year
Data Location Your infrastructure Their cloud
Customization Full control Limited
Integration Any Python app Vendor lock-in

Enterprise CDPs solve identity as part of a larger platform. c1v-id gives you just the identity resolution piece to embed anywhere.

Core Concepts

Concept What It Does Why It Matters
Normalization Cleans emails, phones, names John.Doe+tag@Gmail.comjohndoe@gmail.com
Blocking Groups likely matches Reduces O(n²) to ~O(n)
Scoring Calculates similarity Weighted fuzzy matching across fields
Clustering Groups transitive matches If A≈B and B≈C, then A∈C
Golden Records Merges duplicates Best value wins per survivorship rules

Low-Level API

For custom pipelines, use the building blocks directly:

Normalization

from c1v_id import norm_email, norm_phone, norm_name

norm_email("John.Doe+tag@Gmail.com")  # 'johndoe@gmail.com'
norm_phone("(555) 123-4567")          # '5551234567'
norm_name("  JOHN   DOE  ")           # 'john doe'

Blocking

from c1v_id import email_domain_last4, phone_last7, make_blocks

email_domain_last4("john@gmail.com")  # 'gmail.com|john'
phone_last7("555-123-4567")           # '1234567'

blocks = make_blocks(df, ["email_domain_last4", "phone_last7"])

Clustering

from c1v_id import UnionFind

uf = UnionFind([1, 2, 3, 4, 5])
uf.union(1, 2)
uf.union(2, 3)
uf.find(1) == uf.find(3)  # True (transitive)
uf.get_clusters()         # {1: [1, 2, 3], 4: [4], 5: [5]}

Golden Records

from c1v_id import build_golden_records, SurvivorshipRule

rules = {
    "email": SurvivorshipRule.MOST_RECENT,
    "address": SurvivorshipRule.LONGEST,
    "first": SurvivorshipRule.FIRST_NON_NULL,
}

golden = build_golden_records(df, clusters, rules, source_priority=["crm", "web"])

Use Cases

  • AI Agents: Check if a customer exists before creating a new record
  • CRM Deduplication: Merge duplicate contacts from multiple sources
  • Lead Routing: Match incoming leads to existing opportunities
  • Customer Support: Find customer context across fragmented records
  • Data Migration: Deduplicate when merging systems

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

c1v_id-0.1.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

c1v_id-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file c1v_id-0.1.0.tar.gz.

File metadata

  • Download URL: c1v_id-0.1.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for c1v_id-0.1.0.tar.gz
Algorithm Hash digest
SHA256 080282e8f1c325cbecb0ac8647c25cbd7f59f6b2f9188a43520769b8b018accc
MD5 352179645a1bd6a8511b04610e9b7efc
BLAKE2b-256 ad704a644e6cae5fb66c5605c71986eddbc655a0364dfd3ed30ef0418282a713

See more details on using hashes here.

Provenance

The following attestation bundles were made for c1v_id-0.1.0.tar.gz:

Publisher: publish.yml on davidancor/c1v-id

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file c1v_id-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: c1v_id-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for c1v_id-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0ca779c292e550238482a39b8f2c3306cd8582f58c23999ce46a90fb8074fbf
MD5 1f3f696a2b9f1d2810567299e35c2d2e
BLAKE2b-256 118dd14e900ff1a8d39708c28f5dd0438c85fd61356a0021b166abb84ff3a038

See more details on using hashes here.

Provenance

The following attestation bundles were made for c1v_id-0.1.0-py3-none-any.whl:

Publisher: publish.yml on davidancor/c1v-id

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page