Interactive disambiguation of rows in a dataset using value-of-information policies
Project description
rowvoi: Minimal keys and row-wise value-of-information for disambiguating tabular records
RowVoi is a Python library for interactive row disambiguation in tabular data. It helps you find the minimal set of columns needed to distinguish rows, and suggests which column to query next to reduce ambiguity most efficiently.
🎯 Key Features
- Deterministic Key Finding: Use set cover algorithms to find minimal distinguishing column sets
- Interactive Disambiguation: Step-by-step column selection using value-of-information
- Multiple Policies: Greedy coverage, mutual information, model-based, and random selection strategies
- Cost-Aware Selection: Support for column acquisition costs and budget constraints
- Comprehensive Evaluation: Tools for benchmarking and comparing different strategies
📚 Use Cases
Entity Resolution & Deduplication
- Customer Matching: Which fields (email, phone, address) uniquely identify customers?
- Product Catalogs: What attributes distinguish similar items across suppliers?
Interactive Data Collection
- Survey Optimization: Which demographic questions resolve identity ambiguity?
- Medical Diagnosis: What tests provide maximum diagnostic information?
Active Learning & Feature Selection
- Costly Features: When API calls or lab tests are expensive, which ones matter most?
- Human-in-the-Loop: Guide annotators to the most informative questions
🚀 Quick Start
Installation
pip install rowvoi
Basic Example
import pandas as pd
from rowvoi import find_key, CandidateState, GreedyCoveragePolicy, DisambiguationSession
# Your data
df = pd.DataFrame({
'name': ['Alice', 'Alice', 'Bob', 'Bob'],
'age': [25, 25, 30, 30],
'city': ['NYC', 'LA', 'NYC', 'SF'],
'email': ['a1@x.com', 'a2@x.com', 'b1@x.com', 'b2@x.com']
})
# Find minimal distinguishing columns
rows = [0, 1, 2, 3]
key = find_key(df, rows) # -> ['email']
# Interactive disambiguation
state = CandidateState.uniform([0, 1]) # Alice records
policy = GreedyCoveragePolicy()
session = DisambiguationSession(df, [0, 1], policy=policy)
# Get next question
suggestion = session.next_question()
print(f"Ask about: {suggestion.col}") # -> 'city' or 'email'
# Observe an answer and update
step = session.observe('city', 'NYC')
print(f"Remaining candidates: {session.state.candidate_rows}") # -> [0]
📖 API Overview
Core Types
from rowvoi import CandidateState, FeatureSuggestion
# Track disambiguation state
state = CandidateState(
candidate_rows=[0, 1, 2], # Possible rows
posterior=np.array([0.5, 0.3, 0.2]), # Probabilities
observed_cols={'name'}, # Asked columns
observed_values={'name': 'Alice'} # Observed values
)
# Column recommendation
suggestion = FeatureSuggestion(
col='age', # Recommended column
score=1.2, # Selection score
expected_voi=0.8, # Expected information gain
marginal_cost=2.0 # Query cost
)
Deterministic Key Finding
from rowvoi import KeyProblem, find_key, plan_key_path
# Find minimal key
key = find_key(df, rows=[0,1,2], strategy="greedy")
# Different algorithms
key_exact = find_key(df, rows, strategy="exact") # Optimal (slow)
key_sa = find_key(df, rows, strategy="sa") # Simulated annealing
key_ga = find_key(df, rows, strategy="ga") # Genetic algorithm
# Plan acquisition sequence
path = plan_key_path(df, rows, costs={'name': 1, 'email': 5})
print(path.columns()) # -> ['name', 'age', ...]
Interactive Policies
from rowvoi import GreedyCoveragePolicy, CandidateMIPolicy, MIPolicy, RandomPolicy
# Greedy pairwise coverage
policy = GreedyCoveragePolicy(
costs={'email': 5.0, 'name': 1.0},
objective='entropy' # or 'pairs'
)
# Mutual information on candidates only
policy = CandidateMIPolicy(normalize=True)
# Model-based mutual information
from rowvoi import RowVoiModel
model = RowVoiModel(noise=0.1).fit(df)
policy = MIPolicy(model=model, objective='mi_over_cost')
# Random baseline
policy = RandomPolicy(seed=42)
Interactive Sessions
from rowvoi import DisambiguationSession, StopRules
# Create session
session = DisambiguationSession(
df, [0,1,2,3],
policy=GreedyCoveragePolicy()
)
# Manual interaction
suggestion = session.next_question()
step = session.observe('age', 25)
# Automated session
stop = StopRules(max_steps=5, cost_budget=10.0, target_unique=True)
steps = session.run(stop, true_row=1) # Simulate answering
print(f"Resolved in {len(steps)} steps")
Evaluation & Benchmarking
from rowvoi import sample_candidate_sets, evaluate_policies, evaluate_keys
# Sample test cases
candidate_sets = sample_candidate_sets(df, subset_size=4, n_samples=20)
# Compare key-finding methods
methods = {
'greedy': lambda df, rows: find_key(df, rows, strategy='greedy'),
'exact': lambda df, rows: find_key(df, rows, strategy='exact')
}
key_results = evaluate_keys(df, candidate_sets, methods)
# Compare interactive policies
policies = {
'greedy': GreedyCoveragePolicy(),
'mi': CandidateMIPolicy(),
'random': RandomPolicy(seed=42)
}
policy_stats = evaluate_policies(df, candidate_sets, policies)
for stat in policy_stats:
print(f"{stat.name}: {stat.mean_steps:.1f} steps, {stat.success_rate:.1%} success")
🔬 Advanced Features
Cost-Aware Selection
# Define column costs
costs = {
'name': 1.0, # Cheap: already have
'age': 2.0, # Moderate: need to ask
'email': 10.0, # Expensive: need verification
'ssn': 50.0 # Very expensive: sensitive
}
policy = GreedyCoveragePolicy(costs=costs)
session = DisambiguationSession(df, rows, policy=policy, feature_costs=costs)
# Budget-constrained planning
path = plan_key_path(df, rows, costs=costs)
affordable_cols = path.prefix_for_budget(budget=5.0)
Model-Based Selection
# Train on historical data
model = RowVoiModel(
noise=0.05, # Account for measurement noise
normalize_cols=True # Normalize feature distributions
).fit(df)
# Use for adaptive selection
policy = MIPolicy(model=model, feature_costs=costs)
suggestion = policy.suggest(df, state)
print(f"Expected VoI: {suggestion.expected_voi:.3f} bits")
Probabilistic Methods
from rowvoi import find_key_probabilistic, plan_key_path_probabilistic
# Account for noise/uncertainty
key = find_key_probabilistic(df, rows, noise_rate=0.1)
path = plan_key_path_probabilistic(df, rows, noise_rate=0.1, costs=costs)
📊 Complete Examples
Example 1: Customer Deduplication
import pandas as pd
from rowvoi import find_key, GreedyCoveragePolicy, DisambiguationSession
# Customer database with potential duplicates
customers = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane'],
'last_name': ['Smith', 'Smith', 'Doe', 'Smith'],
'email': ['j1@ex.com', 'j2@ex.com', 'jane@ex.com', 'j3@ex.com'],
'phone': ['555-0101', '555-0102', '555-0201', '555-0301'],
'zip_code': ['10001', '10002', '10001', '10001']
})
# Find minimal fields for disambiguation
duplicates = [0, 1] # Two "John Smith" records
key = find_key(customers, duplicates)
print(f"Minimal distinguishing fields: {key}")
# Interactive disambiguation with costs
costs = {'email': 1, 'phone': 2, 'zip_code': 1, 'first_name': 0, 'last_name': 0}
policy = GreedyCoveragePolicy(costs=costs, objective='entropy')
session = DisambiguationSession(customers, duplicates, policy=policy, feature_costs=costs)
# Simulate resolving the duplicate
suggestion = session.next_question()
print(f"First question: {suggestion.col}")
step = session.observe(suggestion.col, customers.iloc[0][suggestion.col])
print(f"Resolved: {session.state.is_unique}")
Example 2: Survey Optimization
from rowvoi import CandidateMIPolicy, StopRules, evaluate_policies
# Survey response data
survey = pd.DataFrame({
'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35'],
'income': ['<50k', '50-100k', '<50k', '>100k', '50-100k'],
'education': ['HS', 'College', 'HS', 'Graduate', 'College'],
'location': ['Urban', 'Suburban', 'Rural', 'Urban', 'Suburban']
})
# Compare question-asking strategies
policies = {
'coverage': GreedyCoveragePolicy(objective='entropy'),
'mutual_info': CandidateMIPolicy(normalize=True),
'random': RandomPolicy(seed=42)
}
# Test on random respondent groups
candidate_sets = sample_candidate_sets(survey, subset_size=3, n_samples=50)
stop_rules = StopRules(max_steps=3, target_unique=True)
stats = evaluate_policies(survey, candidate_sets, policies, stop=stop_rules)
for stat in stats:
print(f"{stat.name}: {stat.mean_steps:.1f} questions, "
f"{stat.success_rate:.0%} identification rate")
🧪 Algorithm Details
Set Cover for Keys
Finding minimal distinguishing columns is NP-hard set cover:
- Universe: All pairs of rows that need distinguishing
- Sets: Each column covers pairs it separates
- Goal: Minimum cost column set covering all pairs
RowVoi implements:
- Greedy: Fast O(nm log m) approximation with ln(m) ratio guarantee
- Exact: Branch-and-bound for optimal solutions (small problems)
- Metaheuristics: Simulated annealing and genetic algorithms for large problems
Value of Information
For interactive selection, RowVoi uses mutual information:
I(RowID; Column | Observed) = H(RowID | Observed) - E[H(RowID | Observed, Column)]
Where:
H(RowID | Observed): Current uncertainty (entropy) over which row is correctE[H(RowID | Observed, Column)]: Expected uncertainty after observing the column- Higher mutual information = more disambiguation value
Policy Strategies
- GreedyCoveragePolicy: Maximize newly distinguished pairs per cost
- CandidateMIPolicy: Maximize mutual information on current candidates
- MIPolicy: Use fitted model for robust MI estimation with noise handling
- RandomPolicy: Random selection baseline for comparison
📝 Development
Running Tests
uv run pytest tests/ -v
Code Quality
uv run ruff check .
uv run mypy .
Building Documentation
cd docs && make html
🤝 Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
📄 License
MIT License - see LICENSE for details.
📚 Citation
@software{sood2025rowvoi,
author = {Sood, Gaurav},
title = {RowVoi: Interactive Row Disambiguation with Value-of-Information},
year = {2025},
publisher = {GitHub},
url = {https://github.com/gojiplus/rowvoi},
version = {0.2.0}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rowvoi-0.2.0.tar.gz.
File metadata
- Download URL: rowvoi-0.2.0.tar.gz
- Upload date:
- Size: 29.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
465cf55a792f19077412facc94f4aa2ae070406740c7733d9ebad2bc24f37e58
|
|
| MD5 |
0ef3ae0bdbfb75693dbd90d9e281aa3f
|
|
| BLAKE2b-256 |
38f85ea49f2e4e5b71bef1719bf92d3f800442c1ab139f90749923e5a6dc718d
|
Provenance
The following attestation bundles were made for rowvoi-0.2.0.tar.gz:
Publisher:
python-publish.yml on gojiplus/rowvoi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rowvoi-0.2.0.tar.gz -
Subject digest:
465cf55a792f19077412facc94f4aa2ae070406740c7733d9ebad2bc24f37e58 - Sigstore transparency entry: 725901932
- Sigstore integration time:
-
Permalink:
gojiplus/rowvoi@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gojiplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file rowvoi-0.2.0-py3-none-any.whl.
File metadata
- Download URL: rowvoi-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1f930224fef0ad8b80faef2ab751e6d330ac8b18022a8bf69f2dd544b6ddee5
|
|
| MD5 |
3945f03f088bec5e16db4eee53d65294
|
|
| BLAKE2b-256 |
a293332ec37b7124501081da7123cb3f6427f894bdb8bcf106911d13b1696d67
|
Provenance
The following attestation bundles were made for rowvoi-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on gojiplus/rowvoi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rowvoi-0.2.0-py3-none-any.whl -
Subject digest:
c1f930224fef0ad8b80faef2ab751e6d330ac8b18022a8bf69f2dd544b6ddee5 - Sigstore transparency entry: 725901956
- Sigstore integration time:
-
Permalink:
gojiplus/rowvoi@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gojiplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab -
Trigger Event:
workflow_dispatch
-
Statement type: