Identity resolution for AI applications - resolve duplicates in 10 lines of Python
Project description
c1v-id
Identity resolution for AI applications
AI agents that interact with customers, CRMs, or any system of record face a critical decision point: is this person already in our system, or should we create a new record? Because both input and existing data are often messy, agents can confuse customer records, pollute data with duplicates, or deliver poor customer experiences.
c1v-id is an open-source identity resolution library that sits between the agent and the system of record, answering identity queries in milliseconds. It uses probabilistic record linkage with blocking strategies (~O(n) vs naive O(n²)), weighted multi-field scoring, and transitive clustering. Designed as a drop-in for LangChain agents, n8n workflows, and RAG pipelines. Zero ML dependencies. Configurable survivorship rules.
Installation
pip install c1v-id
Quick Start
Resolve duplicates in 10 lines of Python:
from c1v_id import IdentityResolver
resolver = IdentityResolver()
records = [
{"email": "john@gmail.com", "name": "John Doe", "phone": "555-1234"},
{"email": "john@gmail.com", "name": "J. Doe", "phone": "555-1234"},
{"email": "jane@gmail.com", "name": "Jane Smith"},
]
golden = resolver.resolve(records)
print(f"Input: {len(records)} records → Output: {len(golden)} golden records")
# Input: 3 records → Output: 2 golden records
Match Two Records
result = resolver.match(
{"email": "john@gmail.com", "name": "John"},
{"email": "john@gmail.com", "name": "Johnny"}
)
print(result.score) # 0.97
print(result.decision) # 'auto_merge'
print(result.matched_on) # ['email', 'name']
Find Matches in Existing Data
incoming = {"email": "john@gmail.com", "name": "John"}
existing = [
{"id": "1", "email": "john@gmail.com", "name": "John Doe"},
{"id": "2", "email": "jane@gmail.com", "name": "Jane Doe"},
]
matches = resolver.find_matches(incoming, existing)
# Returns best matches sorted by score
Custom Configuration
from c1v_id import IdentityResolver, ResolverConfig, Thresholds, Weights
config = ResolverConfig(
thresholds=Thresholds(auto_merge=0.95, needs_review=0.8),
weights=Weights(email=0.6, phone=0.3, name=0.1, address=0.0),
)
resolver = IdentityResolver(config=config)
Why c1v-id?
vs. Splink
| c1v-id | Splink | |
|---|---|---|
| Hello World | 10 lines | 50+ lines |
| Target | AI builders | Data analysts |
| Setup | pip install |
Spark/DuckDB config |
| ML Required | No | Optional |
| Use Case | Real-time matching | Batch analytics |
Splink is powerful for large-scale data linkage projects with dedicated analysts. c1v-id is for developers who need identity resolution as a feature, not a project.
vs. dedupe
| c1v-id | dedupe | |
|---|---|---|
| Maintenance | Active | Stale (2+ years) |
| Dependencies | 3 (pandas, rapidfuzz, pyyaml) | 10+ |
| Learning Curve | Minimal | Requires training data |
| API Style | resolve(records) |
Iterative labeling |
dedupe requires interactive labeling to train a model. c1v-id works out of the box with sensible defaults.
vs. Enterprise CDPs (Segment, mParticle)
| c1v-id | Enterprise CDP | |
|---|---|---|
| Cost | Free | $100K+/year |
| Data Location | Your infrastructure | Their cloud |
| Customization | Full control | Limited |
| Integration | Any Python app | Vendor lock-in |
Enterprise CDPs solve identity as part of a larger platform. c1v-id gives you just the identity resolution piece to embed anywhere.
Core Concepts
| Concept | What It Does | Why It Matters |
|---|---|---|
| Normalization | Cleans emails, phones, names | John.Doe+tag@Gmail.com → johndoe@gmail.com |
| Blocking | Groups likely matches | Reduces O(n²) to ~O(n) |
| Scoring | Calculates similarity | Weighted fuzzy matching across fields |
| Clustering | Groups transitive matches | If A≈B and B≈C, then A∈C |
| Golden Records | Merges duplicates | Best value wins per survivorship rules |
Low-Level API
For custom pipelines, use the building blocks directly:
Normalization
from c1v_id import norm_email, norm_phone, norm_name
norm_email("John.Doe+tag@Gmail.com") # 'johndoe@gmail.com'
norm_phone("(555) 123-4567") # '5551234567'
norm_name(" JOHN DOE ") # 'john doe'
Blocking
from c1v_id import email_domain_last4, phone_last7, make_blocks
email_domain_last4("john@gmail.com") # 'gmail.com|john'
phone_last7("555-123-4567") # '1234567'
blocks = make_blocks(df, ["email_domain_last4", "phone_last7"])
Clustering
from c1v_id import UnionFind
uf = UnionFind([1, 2, 3, 4, 5])
uf.union(1, 2)
uf.union(2, 3)
uf.find(1) == uf.find(3) # True (transitive)
uf.get_clusters() # {1: [1, 2, 3], 4: [4], 5: [5]}
Golden Records
from c1v_id import build_golden_records, SurvivorshipRule
rules = {
"email": SurvivorshipRule.MOST_RECENT,
"address": SurvivorshipRule.LONGEST,
"first": SurvivorshipRule.FIRST_NON_NULL,
}
golden = build_golden_records(df, clusters, rules, source_priority=["crm", "web"])
Use Cases
- AI Agents: Check if a customer exists before creating a new record
- CRM Deduplication: Merge duplicate contacts from multiple sources
- Lead Routing: Match incoming leads to existing opportunities
- Customer Support: Find customer context across fragmented records
- Data Migration: Deduplicate when merging systems
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file c1v_id-0.1.0.tar.gz.
File metadata
- Download URL: c1v_id-0.1.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
080282e8f1c325cbecb0ac8647c25cbd7f59f6b2f9188a43520769b8b018accc
|
|
| MD5 |
352179645a1bd6a8511b04610e9b7efc
|
|
| BLAKE2b-256 |
ad704a644e6cae5fb66c5605c71986eddbc655a0364dfd3ed30ef0418282a713
|
Provenance
The following attestation bundles were made for c1v_id-0.1.0.tar.gz:
Publisher:
publish.yml on davidancor/c1v-id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
c1v_id-0.1.0.tar.gz -
Subject digest:
080282e8f1c325cbecb0ac8647c25cbd7f59f6b2f9188a43520769b8b018accc - Sigstore transparency entry: 850154385
- Sigstore integration time:
-
Permalink:
davidancor/c1v-id@b3a00f8c49c89a3aca2d636f411a333aa0cb9f1b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/davidancor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a00f8c49c89a3aca2d636f411a333aa0cb9f1b -
Trigger Event:
release
-
Statement type:
File details
Details for the file c1v_id-0.1.0-py3-none-any.whl.
File metadata
- Download URL: c1v_id-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0ca779c292e550238482a39b8f2c3306cd8582f58c23999ce46a90fb8074fbf
|
|
| MD5 |
1f3f696a2b9f1d2810567299e35c2d2e
|
|
| BLAKE2b-256 |
118dd14e900ff1a8d39708c28f5dd0438c85fd61356a0021b166abb84ff3a038
|
Provenance
The following attestation bundles were made for c1v_id-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on davidancor/c1v-id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
c1v_id-0.1.0-py3-none-any.whl -
Subject digest:
d0ca779c292e550238482a39b8f2c3306cd8582f58c23999ce46a90fb8074fbf - Sigstore transparency entry: 850154386
- Sigstore integration time:
-
Permalink:
davidancor/c1v-id@b3a00f8c49c89a3aca2d636f411a333aa0cb9f1b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/davidancor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a00f8c49c89a3aca2d636f411a333aa0cb9f1b -
Trigger Event:
release
-
Statement type: