A package for modeling and forecasting the effectiveness of identification techniques at scale

These details have not been verified by PyPI

Project links

Project description

dataless

A Python package for modeling and forecasting the effectiveness of identification techniques at scale. It provides tools to predict how the accuracy of identification methods changes as the population size increases. The research behind this package is detailed in the paper: A scaling law to model the effectiveness of identification techniques, published in 2025 in Nature Communications.

Overview

This package helps analyze three types of identification methods:

Exact matching: Identifying individuals using exact matches of attributes (e.g., demographics)
Sparse matching: Identification using sparse data points (e.g., location history)
Robust matching: Machine learning-based identification handling noisy or approximate data

Key terminology:

κ (kappa): The fraction of people accurately identified in a population
Gallery size: The number of individuals against which identification is attempted
k-anonymity: A privacy measure ensuring each combination of attributes appears at least k times

Features

Empirical Analysis: Fast numpy code to analyze identification accuracy across different gallery sizes
Scaling Prediction: Two-parameter Bayesian model to forecast identification correctness (κ), uniqueness, and % of k-anonymity violations at larger scales
Extrapolation: Methods to extrapolate small-scale experimental results to real-world scenarios

Installation

From PyPI

pip install pydataless

From source (development)

This project uses uv for package management:

git clone https://github.com/synthetic-society/dataless.git
cd dataless
uv sync

Requirements

Python ≥ 3.10
numpy ≥ 2.0.0
pandas ≥ 2.2.2
scipy ≥ 1.14.0
matplotlib ≥ 3.9.1

Getting Started

Quick Example

The main use case is predicting how identification accuracy degrades as population size increases:

from dataless import PYPExtrapolation
import numpy as np

# Step 1: Create training data with observed accuracy at small scales
# n = population size, correctness = identification accuracy (fraction correctly identified)
n = [10, 50, 100, 500]
correctness = [0.99, 0.97, 0.95, 0.90]

# Step 2: Fit the model
model = PYPExtrapolation(n, correctness=correctness)

# Step 3: Predict accuracy at larger scales
large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)
print(predictions)
# Example output: [0.87505023 0.79165748 0.71371539 0.64308667]

# Step 4: Get a summary of the fitted model
print(model.summary())

You can also train from uniqueness scores:

n = [10, 50, 100, 500]
uniqueness = [0.95, 0.90, 0.85, 0.80] 

model = PYPExtrapolation(n, uniqueness=uniqueness)

large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)

print(predictions)
# Example output: [0.77117386 0.68920739 0.61585755 0.55030553]

Understanding the Output

correctness values range from 0 to 1, where 1 means perfect identification
The model predicts how the correctness decreases as population size grows
This helps assess whether an identification method will remain effective at scale

Available Models

Model	Description	Best for
`PYPExtrapolation`	Pitman-Yor Process	Most scenarios
`FLExtrapolation`	Entropy-based baseline	Baseline
`ExpDecayExtrapolation`	Exponential decay	Baseline
`PolynomialExtrapolation`	Polynomial fit	Baseline

Development

Running Tests

uv sync --extras test
uv run pytest tests/ --cov=dataless --cov-report=xml --cov-report=term

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests to ensure they pass
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Reporting Issues

Please report bugs and request features using the issue tracker. When reporting bugs:

Describe what you expected to happen
Describe what actually happened
Include code samples and error messages if relevant
Include version information (Python, dataless, key dependencies)

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Acknowledgements

If you use this package in your research, please cite:

@article{rocher2025scaling,
  title={A scaling law to model the effectiveness of identification techniques},
  author={Rocher, Luc and Hendrickx, Julien M and Montjoye, Yves-Alexandre de},
  journal={Nature Communications},
  volume={16},
  number={1},
  pages={347},
  year={2025},
  publisher={Nature Publishing Group UK London}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydataless-1.1.0.tar.gz (62.1 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydataless-1.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file pydataless-1.1.0.tar.gz.

File metadata

Download URL: pydataless-1.1.0.tar.gz
Upload date: Feb 9, 2026
Size: 62.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydataless-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8a3f211861580e1d671600d5f24b128371494c2cbe18bbb495d388a328374498`
MD5	`c35bef986869962fc28fab86834dda17`
BLAKE2b-256	`b8efa1cd7f6c45cdf7f71480e391c21c9062d4f81a9ee0c7d4c516010ef01e3a`

See more details on using hashes here.

File details

Details for the file pydataless-1.1.0-py3-none-any.whl.

File metadata

Download URL: pydataless-1.1.0-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 24.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydataless-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6145c2772ece9fdf7f2afec765da82dbea95fe5e647b1a195ff2ffbef206d1ed`
MD5	`a0cf0c9d0ced7b3a8a4c5332b64fc353`
BLAKE2b-256	`50dbc8f401814484e7c24c44db2ae2c8e1faa9fe95aaeaff5d885b914084492d`

See more details on using hashes here.

pydataless 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dataless

Overview

Features

Installation

From PyPI

From source (development)

Requirements

Getting Started

Quick Example

Understanding the Output

Available Models

Development

Running Tests

Contributing

Reporting Issues

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes