Interactive disambiguation of rows in a dataset using value-of-information policies

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

rowvoi: Minimal keys and row-wise value-of-information for disambiguating tabular records

RowVoi is a Python library for interactive row disambiguation in tabular data. It helps you find the minimal set of columns needed to distinguish rows, and suggests which column to query next to reduce ambiguity most efficiently.

🎯 Key Features

Deterministic Key Finding: Use set cover algorithms to find minimal distinguishing column sets
Interactive Disambiguation: Step-by-step column selection using value-of-information
Multiple Policies: Greedy coverage, mutual information, model-based, and random selection strategies
Cost-Aware Selection: Support for column acquisition costs and budget constraints
Comprehensive Evaluation: Tools for benchmarking and comparing different strategies

📚 Use Cases

Entity Resolution & Deduplication

Customer Matching: Which fields (email, phone, address) uniquely identify customers?
Product Catalogs: What attributes distinguish similar items across suppliers?

Interactive Data Collection

Survey Optimization: Which demographic questions resolve identity ambiguity?
Medical Diagnosis: What tests provide maximum diagnostic information?

Active Learning & Feature Selection

Costly Features: When API calls or lab tests are expensive, which ones matter most?
Human-in-the-Loop: Guide annotators to the most informative questions

🚀 Quick Start

Installation

pip install rowvoi

Basic Example

import pandas as pd
from rowvoi import find_key, CandidateState, GreedyCoveragePolicy, DisambiguationSession

# Your data
df = pd.DataFrame({
    'name': ['Alice', 'Alice', 'Bob', 'Bob'],
    'age': [25, 25, 30, 30], 
    'city': ['NYC', 'LA', 'NYC', 'SF'],
    'email': ['a1@x.com', 'a2@x.com', 'b1@x.com', 'b2@x.com']
})

# Find minimal distinguishing columns
rows = [0, 1, 2, 3]
key = find_key(df, rows)  # -> ['email']

# Interactive disambiguation
state = CandidateState.uniform([0, 1])  # Alice records
policy = GreedyCoveragePolicy()
session = DisambiguationSession(df, [0, 1], policy=policy)

# Get next question
suggestion = session.next_question()
print(f"Ask about: {suggestion.col}")  # -> 'city' or 'email'

# Observe an answer and update
step = session.observe('city', 'NYC')  
print(f"Remaining candidates: {session.state.candidate_rows}")  # -> [0]

📖 API Overview

Core Types

from rowvoi import CandidateState, FeatureSuggestion

# Track disambiguation state
state = CandidateState(
    candidate_rows=[0, 1, 2],           # Possible rows
    posterior=np.array([0.5, 0.3, 0.2]), # Probabilities  
    observed_cols={'name'},              # Asked columns
    observed_values={'name': 'Alice'}    # Observed values
)

# Column recommendation
suggestion = FeatureSuggestion(
    col='age',                    # Recommended column
    score=1.2,                   # Selection score
    expected_voi=0.8,           # Expected information gain
    marginal_cost=2.0           # Query cost
)

Deterministic Key Finding

from rowvoi import KeyProblem, find_key, plan_key_path

# Find minimal key
key = find_key(df, rows=[0,1,2], strategy="greedy")

# Different algorithms  
key_exact = find_key(df, rows, strategy="exact")       # Optimal (slow)
key_sa = find_key(df, rows, strategy="sa")             # Simulated annealing
key_ga = find_key(df, rows, strategy="ga")             # Genetic algorithm

# Plan acquisition sequence
path = plan_key_path(df, rows, costs={'name': 1, 'email': 5})
print(path.columns())  # -> ['name', 'age', ...]

Interactive Policies

from rowvoi import GreedyCoveragePolicy, CandidateMIPolicy, MIPolicy, RandomPolicy

# Greedy pairwise coverage
policy = GreedyCoveragePolicy(
    costs={'email': 5.0, 'name': 1.0},
    objective='entropy'  # or 'pairs'
)

# Mutual information on candidates only
policy = CandidateMIPolicy(normalize=True)

# Model-based mutual information  
from rowvoi import RowVoiModel
model = RowVoiModel(noise=0.1).fit(df)
policy = MIPolicy(model=model, objective='mi_over_cost')

# Random baseline
policy = RandomPolicy(seed=42)

Interactive Sessions

from rowvoi import DisambiguationSession, StopRules

# Create session
session = DisambiguationSession(
    df, [0,1,2,3], 
    policy=GreedyCoveragePolicy()
)

# Manual interaction
suggestion = session.next_question()
step = session.observe('age', 25)

# Automated session
stop = StopRules(max_steps=5, cost_budget=10.0, target_unique=True)
steps = session.run(stop, true_row=1)  # Simulate answering
print(f"Resolved in {len(steps)} steps")

Evaluation & Benchmarking

from rowvoi import sample_candidate_sets, evaluate_policies, evaluate_keys

# Sample test cases
candidate_sets = sample_candidate_sets(df, subset_size=4, n_samples=20)

# Compare key-finding methods
methods = {
    'greedy': lambda df, rows: find_key(df, rows, strategy='greedy'),
    'exact': lambda df, rows: find_key(df, rows, strategy='exact')
}
key_results = evaluate_keys(df, candidate_sets, methods)

# Compare interactive policies  
policies = {
    'greedy': GreedyCoveragePolicy(),
    'mi': CandidateMIPolicy(),
    'random': RandomPolicy(seed=42)
}
policy_stats = evaluate_policies(df, candidate_sets, policies)
for stat in policy_stats:
    print(f"{stat.name}: {stat.mean_steps:.1f} steps, {stat.success_rate:.1%} success")

🔬 Advanced Features

Cost-Aware Selection

# Define column costs
costs = {
    'name': 1.0,      # Cheap: already have
    'age': 2.0,       # Moderate: need to ask  
    'email': 10.0,    # Expensive: need verification
    'ssn': 50.0       # Very expensive: sensitive
}

policy = GreedyCoveragePolicy(costs=costs)
session = DisambiguationSession(df, rows, policy=policy, feature_costs=costs)

# Budget-constrained planning
path = plan_key_path(df, rows, costs=costs)
affordable_cols = path.prefix_for_budget(budget=5.0)

Model-Based Selection

# Train on historical data
model = RowVoiModel(
    noise=0.05,           # Account for measurement noise
    normalize_cols=True   # Normalize feature distributions
).fit(df)

# Use for adaptive selection
policy = MIPolicy(model=model, feature_costs=costs)
suggestion = policy.suggest(df, state)
print(f"Expected VoI: {suggestion.expected_voi:.3f} bits")

Probabilistic Methods

from rowvoi import find_key_probabilistic, plan_key_path_probabilistic

# Account for noise/uncertainty
key = find_key_probabilistic(df, rows, noise_rate=0.1)
path = plan_key_path_probabilistic(df, rows, noise_rate=0.1, costs=costs)

📊 Complete Examples

Example 1: Customer Deduplication

import pandas as pd
from rowvoi import find_key, GreedyCoveragePolicy, DisambiguationSession

# Customer database with potential duplicates
customers = pd.DataFrame({
    'first_name': ['John', 'John', 'Jane', 'Jane'],
    'last_name': ['Smith', 'Smith', 'Doe', 'Smith'], 
    'email': ['j1@ex.com', 'j2@ex.com', 'jane@ex.com', 'j3@ex.com'],
    'phone': ['555-0101', '555-0102', '555-0201', '555-0301'],
    'zip_code': ['10001', '10002', '10001', '10001']
})

# Find minimal fields for disambiguation
duplicates = [0, 1]  # Two "John Smith" records
key = find_key(customers, duplicates)
print(f"Minimal distinguishing fields: {key}")

# Interactive disambiguation with costs
costs = {'email': 1, 'phone': 2, 'zip_code': 1, 'first_name': 0, 'last_name': 0}
policy = GreedyCoveragePolicy(costs=costs, objective='entropy')
session = DisambiguationSession(customers, duplicates, policy=policy, feature_costs=costs)

# Simulate resolving the duplicate
suggestion = session.next_question()
print(f"First question: {suggestion.col}")
step = session.observe(suggestion.col, customers.iloc[0][suggestion.col])
print(f"Resolved: {session.state.is_unique}")

Example 2: Survey Optimization

from rowvoi import CandidateMIPolicy, StopRules, evaluate_policies

# Survey response data
survey = pd.DataFrame({
    'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35'],
    'income': ['<50k', '50-100k', '<50k', '>100k', '50-100k'],
    'education': ['HS', 'College', 'HS', 'Graduate', 'College'],
    'location': ['Urban', 'Suburban', 'Rural', 'Urban', 'Suburban']
})

# Compare question-asking strategies
policies = {
    'coverage': GreedyCoveragePolicy(objective='entropy'),
    'mutual_info': CandidateMIPolicy(normalize=True),
    'random': RandomPolicy(seed=42)
}

# Test on random respondent groups
candidate_sets = sample_candidate_sets(survey, subset_size=3, n_samples=50)
stop_rules = StopRules(max_steps=3, target_unique=True)

stats = evaluate_policies(survey, candidate_sets, policies, stop=stop_rules)
for stat in stats:
    print(f"{stat.name}: {stat.mean_steps:.1f} questions, "
          f"{stat.success_rate:.0%} identification rate")

🧪 Algorithm Details

Set Cover for Keys

Finding minimal distinguishing columns is NP-hard set cover:

Universe: All pairs of rows that need distinguishing
Sets: Each column covers pairs it separates
Goal: Minimum cost column set covering all pairs

RowVoi implements:

Greedy: Fast O(nm log m) approximation with ln(m) ratio guarantee
Exact: Branch-and-bound for optimal solutions (small problems)
Metaheuristics: Simulated annealing and genetic algorithms for large problems

Value of Information

For interactive selection, RowVoi uses mutual information:

I(RowID; Column | Observed) = H(RowID | Observed) - E[H(RowID | Observed, Column)]

Where:

H(RowID | Observed): Current uncertainty (entropy) over which row is correct
E[H(RowID | Observed, Column)]: Expected uncertainty after observing the column
Higher mutual information = more disambiguation value

Policy Strategies

GreedyCoveragePolicy: Maximize newly distinguished pairs per cost
CandidateMIPolicy: Maximize mutual information on current candidates
MIPolicy: Use fitted model for robust MI estimation with noise handling
RandomPolicy: Random selection baseline for comparison

📝 Development

Running Tests

uv run pytest tests/ -v

Code Quality

uv run ruff check .
uv run mypy .

Building Documentation

cd docs && make html

🤝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

📄 License

MIT License - see LICENSE for details.

📚 Citation

@software{sood2025rowvoi,
  author       = {Sood, Gaurav},
  title        = {RowVoi: Interactive Row Disambiguation with Value-of-Information},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/gojiplus/rowvoi},
  version      = {0.2.0}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

soodoku

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Nov 26, 2025

0.1.0

Nov 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rowvoi-0.2.0.tar.gz (29.7 kB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rowvoi-0.2.0-py3-none-any.whl (34.8 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file rowvoi-0.2.0.tar.gz.

File metadata

Download URL: rowvoi-0.2.0.tar.gz
Upload date: Nov 26, 2025
Size: 29.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rowvoi-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`465cf55a792f19077412facc94f4aa2ae070406740c7733d9ebad2bc24f37e58`
MD5	`0ef3ae0bdbfb75693dbd90d9e281aa3f`
BLAKE2b-256	`38f85ea49f2e4e5b71bef1719bf92d3f800442c1ab139f90749923e5a6dc718d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rowvoi-0.2.0.tar.gz:

Publisher: python-publish.yml on gojiplus/rowvoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rowvoi-0.2.0.tar.gz
- Subject digest: 465cf55a792f19077412facc94f4aa2ae070406740c7733d9ebad2bc24f37e58
- Sigstore transparency entry: 725901932
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: gojiplus/rowvoi@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab
- Branch / Tag: refs/heads/main
- Owner: https://github.com/gojiplus
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab
- Trigger Event: workflow_dispatch

File details

Details for the file rowvoi-0.2.0-py3-none-any.whl.

File metadata

Download URL: rowvoi-0.2.0-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 34.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rowvoi-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1f930224fef0ad8b80faef2ab751e6d330ac8b18022a8bf69f2dd544b6ddee5`
MD5	`3945f03f088bec5e16db4eee53d65294`
BLAKE2b-256	`a293332ec37b7124501081da7123cb3f6427f894bdb8bcf106911d13b1696d67`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rowvoi-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on gojiplus/rowvoi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rowvoi-0.2.0-py3-none-any.whl
- Subject digest: c1f930224fef0ad8b80faef2ab751e6d330ac8b18022a8bf69f2dd544b6ddee5
- Sigstore transparency entry: 725901956
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: gojiplus/rowvoi@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab
- Branch / Tag: refs/heads/main
- Owner: https://github.com/gojiplus
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@54e8b90065431d0bbc8e86df6ab08e4771e8b1ab
- Trigger Event: workflow_dispatch

rowvoi 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

rowvoi: Minimal keys and row-wise value-of-information for disambiguating tabular records

🎯 Key Features

📚 Use Cases

Entity Resolution & Deduplication

Interactive Data Collection

Active Learning & Feature Selection

🚀 Quick Start

Installation

Basic Example

📖 API Overview

Core Types

Deterministic Key Finding

Interactive Policies

Interactive Sessions

Evaluation & Benchmarking

🔬 Advanced Features

Cost-Aware Selection

Model-Based Selection

Probabilistic Methods

📊 Complete Examples

Example 1: Customer Deduplication

Example 2: Survey Optimization

🧪 Algorithm Details

Set Cover for Keys

Value of Information

Policy Strategies

📝 Development

Running Tests

Code Quality

Building Documentation

🤝 Contributing

📄 License

📚 Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance