A package for modeling and forecasting the effectiveness of identification techniques at scale
Project description
dataless
A Python package for modeling and forecasting the effectiveness of identification techniques at scale. It provides tools to predict how the accuracy of identification methods changes as the population size increases. The research behind this package is detailed in the paper: A scaling law to model the effectiveness of identification techniques, published in 2025 in Nature Communications.
Overview
This package helps analyze three types of identification methods:
- Exact matching: Identifying individuals using exact matches of attributes (e.g., demographics)
- Sparse matching: Identification using sparse data points (e.g., location history)
- Robust matching: Machine learning-based identification handling noisy or approximate data
Key terminology:
- κ (kappa): The fraction of people accurately identified in a population
- Gallery size: The number of individuals against which identification is attempted
- k-anonymity: A privacy measure ensuring each combination of attributes appears at least k times
Features
- Empirical Analysis: Fast numpy code to analyze identification accuracy across different gallery sizes
- Scaling Prediction: Two-parameter Bayesian model to forecast identification correctness (κ), uniqueness, and % of k-anonymity violations at larger scales
- Extrapolation: Methods to extrapolate small-scale experimental results to real-world scenarios
Installation
From PyPI
pip install pydataless
From source (development)
This project uses uv for package management:
git clone https://github.com/synthetic-society/dataless.git
cd dataless
uv sync
Requirements
- Python ≥ 3.10
- numpy ≥ 2.0.0
- pandas ≥ 2.2.2
- scipy ≥ 1.14.0
- matplotlib ≥ 3.9.1
Getting Started
Quick Example
The main use case is predicting how identification accuracy degrades as population size increases:
from dataless import PYPExtrapolation
import numpy as np
# Step 1: Create training data with observed accuracy at small scales
# n = population size, correctness = identification accuracy (fraction correctly identified)
n = [10, 50, 100, 500]
correctness = [0.99, 0.97, 0.95, 0.90]
# Step 2: Fit the model
model = PYPExtrapolation(n, correctness=correctness)
# Step 3: Predict accuracy at larger scales
large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)
print(predictions)
# Example output: [0.87505023 0.79165748 0.71371539 0.64308667]
# Step 4: Get a summary of the fitted model
print(model.summary())
You can also train from uniqueness scores:
n = [10, 50, 100, 500]
uniqueness = [0.95, 0.90, 0.85, 0.80]
model = PYPExtrapolation(n, uniqueness=uniqueness)
large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)
print(predictions)
# Example output: [0.77117386 0.68920739 0.61585755 0.55030553]
Understanding the Output
- correctness values range from 0 to 1, where 1 means perfect identification
- The model predicts how the correctness decreases as population size grows
- This helps assess whether an identification method will remain effective at scale
Available Models
| Model | Description | Best for |
|---|---|---|
PYPExtrapolation |
Pitman-Yor Process | Most scenarios |
FLExtrapolation |
Entropy-based baseline | Baseline |
ExpDecayExtrapolation |
Exponential decay | Baseline |
PolynomialExtrapolation |
Polynomial fit | Baseline |
Development
Running Tests
uv sync --extras test
uv run pytest tests/ --cov=dataless --cov-report=xml --cov-report=term
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests to ensure they pass
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Reporting Issues
Please report bugs and request features using the issue tracker. When reporting bugs:
- Describe what you expected to happen
- Describe what actually happened
- Include code samples and error messages if relevant
- Include version information (Python, dataless, key dependencies)
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
Acknowledgements
If you use this package in your research, please cite:
@article{rocher2025scaling,
title={A scaling law to model the effectiveness of identification techniques},
author={Rocher, Luc and Hendrickx, Julien M and Montjoye, Yves-Alexandre de},
journal={Nature Communications},
volume={16},
number={1},
pages={347},
year={2025},
publisher={Nature Publishing Group UK London}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydataless-1.1.0.tar.gz.
File metadata
- Download URL: pydataless-1.1.0.tar.gz
- Upload date:
- Size: 62.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a3f211861580e1d671600d5f24b128371494c2cbe18bbb495d388a328374498
|
|
| MD5 |
c35bef986869962fc28fab86834dda17
|
|
| BLAKE2b-256 |
b8efa1cd7f6c45cdf7f71480e391c21c9062d4f81a9ee0c7d4c516010ef01e3a
|
File details
Details for the file pydataless-1.1.0-py3-none-any.whl.
File metadata
- Download URL: pydataless-1.1.0-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6145c2772ece9fdf7f2afec765da82dbea95fe5e647b1a195ff2ffbef206d1ed
|
|
| MD5 |
a0cf0c9d0ced7b3a8a4c5332b64fc353
|
|
| BLAKE2b-256 |
50dbc8f401814484e7c24c44db2ae2c8e1faa9fe95aaeaff5d885b914084492d
|