Interactive disambiguation of rows in a dataset using value-of-information policies
Project description
rowvoi: finding minimal distinguishing columns and the next best feature to query
Here is the revised README content converted to math-safe Markdown:
rowvoi 🚀
rowvoi is a small library for row disambiguation in tabular data:
Given a set of candidate rows and a set of columns,
- which minimal set of columns distinguishes those rows?
- if some columns are not yet observed, which column should you query next to reduce ambiguity as quickly as possible?
It combines:
- classical functional dependencies / minimal keys (set cover on row pairs),
- information-theoretic value-of-information (mutual information between row identity and a column),
- and a light model-based layer for instance-specific, adaptive feature acquisition.
Table of Contents
- Where this fits in the literature
- Business Problems & Use Cases
- Installation
- Core Concepts
- API Overview
- Quick Start
- Documentation & Development
1. Where this fits in the literature
1.1 Deterministic keys and set cover
Suppose you have a subset of rows ($R$) from a DataFrame and want the smallest set of columns ($C$) such that every pair of rows in ($R$) differs on at least one column in ($C$). In database terms, ($C$) is a key for ($R$); in combinatorics, this is a set cover problem:
- Universe ($U$): all unordered pairs of rows (${i,j}$) in ($R$).
- Each column ($c$): covers the pairs it distinguishes $$S_c = {{i,j} \in U : x_{ic} \neq x_{jc}}$$
- Goal: find a minimum-cost ($C$) such that $$\bigcup_{c\in C} S_c = U$$
Set cover is NP-hard; greedy set cover achieves a logarithmic approximation factor and is optimal up to constants under standard complexity assumptions. This is exactly what minimal_key_greedy implements for rows and columns.
1.2 Mutual information and value-of-information
Now suppose some columns are missing (because you haven't asked the user, run the test, or fetched the field yet). You have a candidate set of rows ($R$), and a belief over which row is "the truth":
- random variable ($\mathbf{R}$) = row identity,
- current evidence ($\mathbf{E}$) (already observed columns and their values),
- candidate column ($\mathbf{X}_j$) you might query next.
The value of querying column ($j$) for disambiguating ($\mathbf{R}$) is the conditional mutual information: $$I(\mathbf{R}; \mathbf{X}j \mid \mathbf{E}) = H(\mathbf{R} \mid \mathbf{E}) - \mathbb{E}{\mathbf{X}_j \mid \mathbf{E}}[H(\mathbf{R} \mid \mathbf{E}, \mathbf{X}_j)]$$ where ($H(\cdot)$) is Shannon entropy. Intuitively:
- $H(\mathbf{R} \mid \mathbf{E})$: how confused you are now about which row is correct;
- $H(\mathbf{R} \mid \mathbf{E}, \mathbf{X}_j)$: how confused you'll be after seeing ($\mathbf{X}_j$), averaged over the possible values it might take.
Greedy "ask the column with largest $I(\mathbf{R}; \mathbf{X}_j \mid \mathbf{E})$" is the local, data-table version of:
- information gain in decision trees,
- active feature acquisition and adaptive submodular optimization,
- and Bayesian experimental design / adaptive testing.
candidate_mi and RowVoiModel.suggest_next_feature implement this style of policy.
1.3 Probabilistic & $\varepsilon$-relaxed disambiguation (conceptual)
In practice:
- columns might be noisy or locally constant,
- you might be happy to stop with a small residual ambiguity.
Two complementary notions of "$\varepsilon$-good enough" show up naturally:
- Posterior-based: stop when the posterior on the most probable row exceeds ($1-\varepsilon$): $$1 - \max_r p(r \mid E) \le \varepsilon$$
- Pairwise coverage-based: stop when the fraction of unresolved row pairs is $\le \varepsilon$.
The underlying math connects to:
- probabilistic set cover / chance-constrained covering, where coverage events are random;
- stochastic submodular cover / adaptive submodular optimization, where you choose columns sequentially and observe their realizations.
rowvoi's deterministic minimal keys and MI-based policies sit on top of this theory; the package is designed so that $\varepsilon$-style stopping rules and probabilistic coverage models can be added without changing the core API.
2. Business Problems & Use Cases
Some concrete places where "which columns distinguish these rows?" and "what should I ask next?" show up:
Entity Resolution & Data Deduplication
- Customer Matching: Merge customer databases by asking/planning which fields (email, phone, address, DOB) to verify so that a small set of checks uniquely identifies the right record.
- Product Catalog Matching: When multiple supplier SKUs look similar, decide which attributes (brand, size, color, GTIN) to compare to separate otherwise ambiguous items.
Interactive Data Cleaning
- Record Validation: Data quality issues may leave multiple plausible matches for a record.
rowvoican suggest which field to check next (e.g. postal code vs last name) to resolve the ambiguity. - Survey Data Linkage: When linking survey responses to a master frame, choose which demographic questions to ask (or re-ask) to uniquely identify the record.
Active Learning & Human-in-the-Loop ML
- Annotation Prioritization: Treat each candidate match or cluster of rows as a decision problem and ask humans to label the fields that most reduce confusion.
- Costly Feature Selection: For models that can request expensive features (lab tests, manual review, external API calls), use value-of-information to decide which features are worth acquiring.
Fraud Detection & Investigation
- Transaction Investigation: When several transactions or identities look suspiciously similar, choose which account details or metadata fields to inspect first.
- Identity Verification: Build a short, adaptive script of verification questions that usually identifies a user with a handful of high-information questions.
3. Installation
pip install rowvoi
For development:
uv pip install -e ".[dev,docs]"
4. Core Concepts
rowvoi is designed around a few simple ideas:
- Rows as candidates: A small set of row indices in a DataFrame
df(e.g.[12, 45, 101]) that are plausible matches, models, or strategies. - Columns as questions: DataFrame columns are candidate features you can use to distinguish those rows. Some may already be observed; others may be missing or costly.
- Keys / minimal distinguishing sets: A set of columns is a key for the candidate set if no two rows agree on all those columns. Minimal keys are the set-cover side of the problem.
CandidateState: A lightweight object that tracks:- which rows are still in play,
- your current posterior over them,
- which columns you've already queried and what values you saw.
- Policies over columns: Given a
CandidateState, a policy (likeRowVoiModel) suggests the next best column to query using mutual information / expected entropy reduction.
5. API Overview
5.1 Types (rowvoi.types)
RowIndex: Type alias for row indices (usuallyint).ColName: Type alias for column labels (any hashable).CandidateState:CandidateState( candidate_rows: list[RowIndex], posterior: dict[RowIndex, float], observed_cols: list[ColName], observed_values: dict[ColName, object], )
Represents the state of an interactive disambiguation session.
5.2 Deterministic / Logical Methods (rowvoi.logical)
These functions assume all relevant column values are known in df and solve the minimal key / functional dependency problem.
is_key(df, rows, cols) -> bool: Check whether the columns incolsuniquely distinguish the rows inrows(no two rows have identical values on allcols).minimal_key_exact(df, rows, columns=None, max_search_cols=20) -> list[ColName]: Use exhaustive search / branch-and-bound over columns to find a provably minimal key for the specified rows. Exponential in the number of columns considered;max_search_colsbounds the search.minimal_key_greedy(df, rows, columns=None, costs=None) -> list[ColName]:- Greedy set cover on row pairs:
- Universe = all unordered pairs of rows in
rows, - At each step, pick the column that distinguishes the most unresolved pairs (optionally normalized by cost),
- Stop when all pairs are separated.
- Universe = all unordered pairs of rows in
- This is fast and comes with the standard logarithmic approximation guarantee for set cover.
- Greedy set cover on row pairs:
5.3 Mutual Information (rowvoi.mi)
Local, instance-specific mutual information between row identity and a column.
candidate_mi(df, state: CandidateState, col: ColName) -> float:- Compute $I(\mathbf{R}; \mathbf{X}_{\text{col}} \mid \mathbf{E})$ where:
- $\mathbf{R}$ is "which row from
state.candidate_rowsis true?" with prior/posteriorstate.posterior, - $\mathbf{E}$ is the current evidence encoded in
state.observed_colsandstate.observed_values.
- $\mathbf{R}$ is "which row from
- Compute $I(\mathbf{R}; \mathbf{X}_{\text{col}} \mid \mathbf{E})$ where:
best_feature_by_candidate_mi(df, state: CandidateState, candidate_cols=None) -> ColName: Among the unobserved columns (default: all DataFrame columns not instate.observed_cols), return the one with highest mutual information with row identity.
5.4 Model-Based Value of Information (rowvoi.ml)
A small model class for reusing global information (e.g. value distributions, noise models) to make feature selection more robust.
RowVoiModel(smoothing=1e-6, noise=0.0, normalize_cols=True): A model-based policy for next-feature selection.fit(df, discrete_cols=None, bins=3) -> RowVoiModel: Optionally discretizes numeric columns and precomputes marginal distributions / metadata needed for MI approximation and noise handling.suggest_next_feature(df, state: CandidateState, candidate_cols=None, objective="mi", feature_costs=None) -> FeatureSuggestion:- Suggest the next column to query for this state.
objective="mi": maximize expected mutual information ($I(\mathbf{R}; \mathbf{X}_j \mid \mathbf{E})$)."mi_over_cost": maximize MI per unit feature cost.- Returns a
FeatureSuggestionobject:FeatureSuggestion( col: ColName, voi: float, # estimated information gain normalized_voi: float, # optionally MI / H(X_j) details: dict[str, float] )
run_until_epsiloncan be built on top to run a full session until posterior uncertainty is below a chosen $\varepsilon$.)
5.5 Simulation & Benchmarking (rowvoi.simulate)
Tools for testing and benchmarking policies on synthetic or real datasets.
sample_candidate_sets(df, k: int, n_samples: int, rng=None) -> list[list[RowIndex]]: Randomly samplen_samplessubsets ofkrows fromdf.AcquisitionResult: A small record type storing results for one disambiguation episode (e.g., number of questions asked, whether the correct row was identified, which columns were used).benchmark_policy(df, candidate_sets, policy, max_steps=10) -> BenchmarkResult: Run a policy against a list of candidate sets and summarize its behavior. The policy is a callable like:def policy(df, state: CandidateState) -> ColName: ...
or, for simpler cases, a function mapping (df,rows) to a column name.
6. Quick Start
6.1 Deterministic keys
import pandas as pd
from rowvoi import minimal_key_greedy, minimal_key_exact
df = pd.DataFrame({
"A": [1, 1, 2],
"B": [3, 4, 3],
"C": [5, 6, 7],
})
# Find minimal distinguishing columns for rows 0 and 1
rows = [0, 1]
print(minimal_key_greedy(df, rows)) # e.g. ['B']
print(minimal_key_exact(df, rows)) # ['B']
6.2 Model-based next-feature selection
from rowvoi import RowVoiModel, CandidateState
# Fit model on historical data
model = RowVoiModel().fit(df)
# Two candidate rows, uniform prior, nothing observed yet
state = CandidateState(
candidate_rows=[0, 2],
posterior={0: 0.5, 2: 0.5},
observed_cols=[],
observed_values={},
)
suggestion = model.suggest_next_feature(df, state)
print(f"Next feature: {suggestion.col}")
print(f"Estimated VoI: {suggestion.voi:.3f}")
6.3 Mutual information directly
from rowvoi import candidate_mi, best_feature_by_candidate_mi
mi_A = candidate_mi(df, state, "A")
print(f"MI between row ID and A: {mi_A:.3f}")
best_col = best_feature_by_candidate_mi(df, state)
print(f"Best column by MI: {best_col}")
6.4 Benchmarking a policy
from rowvoi import sample_candidate_sets, benchmark_policy
# Sample 10 random pairs of rows
candidate_sets = sample_candidate_sets(df, k=2, n_samples=10)
def simple_policy(df, state):
# use the model to suggest next feature
return model.suggest_next_feature(df, state).col
results = benchmark_policy(df, candidate_sets, policy=simple_policy)
print(f"Average queries needed: {results.mean_queries:.2f}")
7. Documentation & Development
Full documentation and examples:
👉 https://gojiplus.github.io/rowvoi/
Tests
make test
Linting & Formatting
make lint
make format
Build Docs
make docs
Local CI
make ci-docker
Citation
If you use rowvoi in your research, please cite it using:
@software{sood2025rowvoi,
author = {Sood, Gaurav},
title = {RowVoi: Row-wise Value of Information for Data Collection},
year = {2025},
publisher = {GitHub},
url = {https://github.com/gojiplus/rowvoi},
version = {0.1.0}
}
Or use the CITATION.cff file for automatic citation generation in GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rowvoi-0.1.0.tar.gz.
File metadata
- Download URL: rowvoi-0.1.0.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1ef8d8ce4aeed1662f28102ed179522fe50a65cdef7b82db0c9d20429b7f078
|
|
| MD5 |
b19cfb437afef238c6503ca5053a1c4d
|
|
| BLAKE2b-256 |
514ba527fcf8c51a1e21e24df3366cf376aaa95efbabb1fc474449e33c9a54ea
|
File details
Details for the file rowvoi-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rowvoi-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ca8bcf300f44e3c89119ce81088dab4bfb950a8b269f5740aaaf72d8ee4032c
|
|
| MD5 |
8bbb3cf0c4c975953ac5ae01d25bbb6d
|
|
| BLAKE2b-256 |
375de87ac39bb76c9a28dc9fad5dd039249fb58a356d4dca3b08a50ac9eab399
|