Skip to main content

A library aiding to create challenges for the AnoMed competition platform.

Project description

AnoMed Challenge

Code style: black

A library aiding to create challenge web servers for the AnoMed competition platform.

Preliminaries

The AnoMed platform is basically a network of web servers which use web APIs to exchange data among each other and provide functionality to each other. Challenge web servers provide training and evaluation data, which may be requested via HTTP. They do also offer means to evaluate the utility of anonymizers (privacy preserving machine learning models) via HTTP and means to estimate the privacy of anonymizers via attacks on them (which we refer to by "deanonymizers" below). Anonymizer web servers offer input/output access, such that they may be attacked by deanonymizers. For more details about anonymizers or deanonymizers, view their corresponding repositories.

In general, you are free to create your own kind of challenge web server, as long as it offers some well described APIs and follows some general principles, which we will describe below. You do not need to use this library to submit challenges. However, if you would like to focus on defining the challenge itself, without being annoyed by web server related questions, use this library to generate web servers "for free", which integrate well with the AnoMed platform.

How to Create Challenge Web Servers (for selected use cases)

If you goal is to create a challenge that fits one of the following selected cases, you may use this library's template to create a challenge web server with minimal effort.

Supervised Learning Challenge with Membership Inference Attack Threat Model

This scenario assumes that solutions to your challenge (i.e. anonymizers) may be trained using only a single NumPy feature array X (no multiple input arrays) and a NumPy array of target values y. The data is split traditionally into three parts: training data (for adjustments of weights), tuning data (for adjustments of the hyperparameters) and validation data (for final evaluation).

The threat model states that membership inference attacks (MIAs) are of interest and may be used to practically estimate the privacy properties of the anonymizers, which claim to be privacy preserving machine learning models. Briefly, MIAs are given a data sample and their goal is to estimate (better than random guessing), whether that data sample was part of the MIA target's training data. The MIA's true positive rate at a low false positive rate threshold serves as an indicator of how well do anonymizers preserve the training data confidentiality.

MIAs are given a subset of the training data samples (members) and a subset of validation data samples (non-members) to train, before they will be evaluated.

In the following example, we create a challenge web server (based on the Falcon web framework) that serves the famous iris dataset and uses plain binary accuracy as a anonymizer utility evaluation metric:

import anomed_challenge as anochal
from sklearn import datasets, model_selection

iris = datasets.load_iris()

X = iris.data  # type: ignore
y = iris.target  # type: ignore

X_train, X_other, y_train, y_other = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42
)
X_tune, X_val, y_tune, y_val = model_selection.train_test_split(
    X_other, y_other, test_size=0.5, random_state=21
)

example_challenge = anochal.SupervisedLearningMIAChallenge(
    training_data=anochal.InMemoryNumpyArrays(X=X_train, y=y_train),
    tuning_data=anochal.InMemoryNumpyArrays(X=X_tune, y=y_tune),
    validation_data=anochal.InMemoryNumpyArrays(X=X_val, y=y_val),
    anonymizer_evaluator=anochal.strict_binary_accuracy,
    MIA_evaluator=anochal.evaluate_MIA,
    MIA_evaluation_dataset_length=5,
)

# This is what GUnicorn expects
application = anochal.supervised_learning_MIA_challenge_server_factory(
    example_challenge
)

The variables *_train, *_tune and *_val contain the challenge data, split as mentioned above. The custom datatype InMemoryNumpyArrays merely bundles features and targets into one object.

The function SupervisedLearningMIAChallenge is the core of this example. It takes challenge specific parameters to return WSGI compatible Falcon web app, which which may be utilized by GUnicorn + nginx to create a full-grown web server. Its arguments training_data, tuning_data and validation_data are self-explaining. anonymizer_evaluator is a function which compares the validation data target values (ground truth) which an anonymizer's prediction and returns float value statistics describing the anonymizer's performance. MIA_evaluator is a function which compares the estimated memberships with the ground truth memberships and returns float value statistics describing the MIA's performance. MIA_evaluation_dataset_length determines the number of members and also the number of non-members to use for MIA success evaluation (so the number of samples is twice this value). If possible, set this value to at least 100.

The web app application serves these routes:

  • [GET] / (this displays an "alive message")
  • [GET] /data/anonymizer/training (this will serve X_train and y_train)
  • [GET] /data/anonymizer/tuning (this will serve X_tune)
  • [GET] /data/anonymizer/validation (this will serve X_val)
  • [GET] /data/deanonymizer/members (this will serve a subset of X_train and y_train)
  • [GET] /data/deanonymizer/non-members (this will serve a subset of X_val and y_val)
  • [GET] /data/attack-success-evaluation (this will serve data from the complement of members and also from the complement of non-members).
  • [POST] /utility/anonymizer (this will evaluate the quality of an anonymizer's prediction compared to y_tune or y_val.)
  • [POST] /utility/deanonymizer (this will evaluate the quality of a denanonymizer's prediction compared to attack-success-evaluation)

Dataset Anonymization Challenge with ??? Threat Model

TODO

Dataset Synthesis Challenge with ??? Threat Model

TODO

How To Create Challenge Web Servers Without Template

In case your challenge is not covered by one of the available templates, we suggest that you also use the Falcon web framework and make use of at least some of the available resource building blocks. Besides that, you should pay attention to the following principles when implementing your challenge:

  • Challenges and submissions will not get any internet access when running on the AnoMed platform. Make your challenge self-containing.
  • Platform users will not get access to challenge data - only submission containers (but not their creators) may access them. That means submission contributors have to create their model blueprints »blindly« and are not able to have a look into the data when hyperparameter tuning. To make life a little easier, we suggest to provide dummy data of the same type and shape as the challenge – but with innocuous content outside of the platform. You may post a hyperlink to it from within your challenge description.
  • Explain your API well in the challenge description, such that custom submissions have it easy to obey your API. Template anonymizers and template deanonymizers are likely incompatible with your custom challenge.
  • Provide a default route / which returns a JSON encoded message like "Challenge server is alive!" for diagnosis, upon GET request.
  • Challenge data used to fit and evaluate anonymizers or deanonymizers should be the same for each submission, to allow for a fair comparison.
  • Evaluation data should be disjoint from training data.
  • Utility and privacy metrics should be floating point scalars, to allow for plotting and ranking. Vectors or even more complex statistics are not suitable for that. Also, they should be clearly defined and fixed before the first submission comes in. It should not be changed retroactively.
  • There should be a way to gain intermediate evaluation results for hyperparameter tuning (e.g. with respect to tuning data, if there is any). The final evaluation however should be accessed only once by each submission, to limit validation data leakage. Such further requests should be rejected.
  • Try to find a good compromise between required network capacity and ease of use, when sending data via web to submissions. For example, when sending raw NumPy arrays over wire, or even plain JSON, no compression is used and usually the required network capacity is large. Compressed files on the other hand might require further processing in downstream tasks. In the supervised learning scenario above for example, we used compressed streams of NumPy arrays and utility functions to make working with them comfortable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anomed_challenge-0.0.20.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anomed_challenge-0.0.20-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file anomed_challenge-0.0.20.tar.gz.

File metadata

  • Download URL: anomed_challenge-0.0.20.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for anomed_challenge-0.0.20.tar.gz
Algorithm Hash digest
SHA256 0ce2a11f731dd326288963ccdaa2c03824f7070986e278eadf9945fae63c595c
MD5 5d14b1e7021205a1cfeb2ae294d56c82
BLAKE2b-256 e37fb0921f360cc74e82f73ee0cb4228c94b2eab830d8c117f2cf84c43004427

See more details on using hashes here.

Provenance

The following attestation bundles were made for anomed_challenge-0.0.20.tar.gz:

Publisher: python-publish.yml on ypotdevin/anomed_challenge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file anomed_challenge-0.0.20-py3-none-any.whl.

File metadata

File hashes

Hashes for anomed_challenge-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 207f94bc800bf985efd128de212a3536038a68247b4bbee51c6b177c0444a3d8
MD5 3e0df1634371733193e6ca776e59deb9
BLAKE2b-256 ec174dc80d0d62c1198fd4156c870201ad88e8cc057ba9cf0e6b751ba5709cd7

See more details on using hashes here.

Provenance

The following attestation bundles were made for anomed_challenge-0.0.20-py3-none-any.whl:

Publisher: python-publish.yml on ypotdevin/anomed_challenge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page