Skip to main content

A package for modeling and forecasting the effectiveness of identification techniques at scale

Project description

dataless

Tests codecov

A Python package for modeling and forecasting the effectiveness of identification techniques at scale. It provides tools to predict how the accuracy of identification methods changes as the population size increases. The research behind this package is detailed in the paper: A scaling law to model the effectiveness of identification techniques, published in 2025 in Nature Communications.

Overview

This package helps analyze three types of identification methods:

  • Exact matching: Identifying individuals using exact matches of attributes (e.g., demographics)
  • Sparse matching: Identification using sparse data points (e.g., location history)
  • Robust matching: Machine learning-based identification handling noisy or approximate data

Key terminology:

  • κ (kappa): The fraction of people accurately identified in a population
  • Gallery size: The number of individuals against which identification is attempted
  • k-anonymity: A privacy measure ensuring each combination of attributes appears at least k times

Features

  • Empirical Analysis: Fast numpy code to analyze identification accuracy across different gallery sizes
  • Scaling Prediction: Two-parameter Bayesian model to forecast identification correctness (κ), uniqueness, and % of k-anonymity violations at larger scales
  • Extrapolation: Methods to extrapolate small-scale experimental results to real-world scenarios

Installation

From PyPI

pip install pydataless

From source (development)

This project uses uv for package management:

git clone https://github.com/synthetic-society/dataless.git
cd dataless
uv sync

Requirements

  • Python ≥ 3.10
  • numpy ≥ 2.0.0
  • pandas ≥ 2.2.2
  • scipy ≥ 1.14.0
  • matplotlib ≥ 3.9.1

Getting Started

Quick Example

The main use case is predicting how identification accuracy degrades as population size increases:

from dataless import PYPExtrapolation
import numpy as np

# Step 1: Create training data with observed accuracy at small scales
# n = population size, correctness = identification accuracy (fraction correctly identified)
n = [10, 50, 100, 500]
correctness = [0.99, 0.97, 0.95, 0.90]

# Step 2: Fit the model
model = PYPExtrapolation(n, correctness=correctness)

# Step 3: Predict accuracy at larger scales
large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)
print(predictions)
# Example output: [0.87505023 0.79165748 0.71371539 0.64308667]

# Step 4: Get a summary of the fitted model
print(model.summary())

You can also train from uniqueness scores:

n = [10, 50, 100, 500]
uniqueness = [0.95, 0.90, 0.85, 0.80] 

model = PYPExtrapolation(n, uniqueness=uniqueness)

large_populations = np.array([1_000, 10_000, 100_000, 1_000_000])
predictions = model.predict(large_populations)

print(predictions)
# Example output: [0.77117386 0.68920739 0.61585755 0.55030553]

Understanding the Output

  • correctness values range from 0 to 1, where 1 means perfect identification
  • The model predicts how the correctness decreases as population size grows
  • This helps assess whether an identification method will remain effective at scale

Available Models

Model Description Best for
PYPExtrapolation Pitman-Yor Process Most scenarios
FLExtrapolation Entropy-based baseline Baseline
ExpDecayExtrapolation Exponential decay Baseline
PolynomialExtrapolation Polynomial fit Baseline

Development

Running Tests

uv sync --extras test
uv run pytest tests/ --cov=dataless --cov-report=xml --cov-report=term

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests to ensure they pass
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Reporting Issues

Please report bugs and request features using the issue tracker. When reporting bugs:

  • Describe what you expected to happen
  • Describe what actually happened
  • Include code samples and error messages if relevant
  • Include version information (Python, dataless, key dependencies)

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Acknowledgements

If you use this package in your research, please cite:

@article{rocher2025scaling,
  title={A scaling law to model the effectiveness of identification techniques},
  author={Rocher, Luc and Hendrickx, Julien M and Montjoye, Yves-Alexandre de},
  journal={Nature Communications},
  volume={16},
  number={1},
  pages={347},
  year={2025},
  publisher={Nature Publishing Group UK London}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydataless-1.1.0.tar.gz (62.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydataless-1.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file pydataless-1.1.0.tar.gz.

File metadata

  • Download URL: pydataless-1.1.0.tar.gz
  • Upload date:
  • Size: 62.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydataless-1.1.0.tar.gz
Algorithm Hash digest
SHA256 8a3f211861580e1d671600d5f24b128371494c2cbe18bbb495d388a328374498
MD5 c35bef986869962fc28fab86834dda17
BLAKE2b-256 b8efa1cd7f6c45cdf7f71480e391c21c9062d4f81a9ee0c7d4c516010ef01e3a

See more details on using hashes here.

File details

Details for the file pydataless-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydataless-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pydataless-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6145c2772ece9fdf7f2afec765da82dbea95fe5e647b1a195ff2ffbef206d1ed
MD5 a0cf0c9d0ced7b3a8a4c5332b64fc353
BLAKE2b-256 50dbc8f401814484e7c24c44db2ae2c8e1faa9fe95aaeaff5d885b914084492d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page