A library aiding to create challenges for the AnoMed competition platform.
Project description
AnoMed Challenge
A library aiding to create challenge web servers for the AnoMed competition platform.
Preliminaries
The AnoMed platform is basically a network of web servers which use web APIs to exchange data among each other and provide functionality to each other. Challenge web servers provide training and evaluation data, which may be requested via HTTP. They do also offer means to evaluate the utility of anonymizers (privacy preserving machine learning models) via HTTP and means to estimate the privacy of anonymizers via attacks on them (which we refer to by "deanonymizers" below). Anonymizer web servers offer input/output access, such that they may be attacked by deanonymizers. For more details about anonymizers or deanonymizers, view their corresponding repositories.
In general, you are free to create your own kind of challenge web server, as long as it offers some well described APIs and follows some general principles, which we will describe below. You do not need to use this library to submit challenges. However, if you would like to focus on defining the challenge itself, without being annoyed by web server related questions, use this library to generate web servers "for free", which integrate well with the AnoMed platform.
How to Create Challenge Web Servers (for selected use cases)
If you goal is to create a challenge that fits one of the following selected cases, you may use this library's template to create a challenge web server with minimal effort.
Supervised Learning Challenge with Membership Inference Attack Threat Model
This scenario assumes that solutions to your challenge (i.e. anonymizers) may be
trained using only a single NumPy feature array X (no multiple input arrays)
and a NumPy array of target values y. The data is split traditionally into
three parts: training data (for adjustments of weights), tuning data (for
adjustments of the hyperparameters) and validation data (for final evaluation).
The threat model states that membership inference attacks (MIAs) are of interest and may be used to practically estimate the privacy properties of the anonymizers, which claim to be privacy preserving machine learning models. Briefly, MIAs are given a data sample and their goal is to estimate (better than random guessing), whether that data sample was part of the MIA target's training data. The MIA's true positive rate at a low false positive rate threshold serves as an indicator of how well do anonymizers preserve the training data confidentiality.
MIAs are given a subset of the training data samples (members) and a subset of validation data samples (non-members) to train, before they will be evaluated.
In the following example, we create a challenge web server (based on the Falcon web framework) that serves the famous iris dataset and uses plain binary accuracy as a anonymizer utility evaluation metric:
import anomed_challenge as anochal
from sklearn import datasets, model_selection
iris = datasets.load_iris()
X = iris.data # type: ignore
y = iris.target # type: ignore
X_train, X_other, y_train, y_other = model_selection.train_test_split(
X, y, test_size=0.3, random_state=42
)
X_tune, X_val, y_tune, y_val = model_selection.train_test_split(
X_other, y_other, test_size=0.5, random_state=21
)
example_challenge = anochal.SupervisedLearningMIAChallenge(
training_data=anochal.InMemoryNumpyArrays(X=X_train, y=y_train),
tuning_data=anochal.InMemoryNumpyArrays(X=X_tune, y=y_tune),
validation_data=anochal.InMemoryNumpyArrays(X=X_val, y=y_val),
anonymizer_evaluator=anochal.strict_binary_accuracy,
MIA_evaluator=anochal.evaluate_MIA,
MIA_evaluation_dataset_length=5,
)
# This is what GUnicorn expects
application = anochal.supervised_learning_MIA_challenge_server_factory(
example_challenge
)
The variables *_train, *_tune and *_val contain the challenge data, split
as mentioned above. The custom datatype InMemoryNumpyArrays merely bundles
features and targets into one object.
The function SupervisedLearningMIAChallenge is the core of this example. It
takes challenge specific parameters to return WSGI compatible Falcon web app,
which which may be utilized by GUnicorn + nginx to create a full-grown web
server. Its arguments training_data, tuning_data and validation_data are
self-explaining. anonymizer_evaluator is a function which compares the
validation data target values (ground truth) which an anonymizer's prediction
and returns float value statistics describing the anonymizer's performance.
MIA_evaluator is a function which compares the estimated memberships with the
ground truth memberships and returns float value statistics describing the MIA's
performance. MIA_evaluation_dataset_length determines the number of members
and also the number of non-members to use for MIA success evaluation (so the
number of samples is twice this value). If possible, set this value to at least 100.
The web app application serves these routes:
- [GET]
/(this displays an "alive message") - [GET]
/data/anonymizer/training(this will serveX_trainandy_train) - [GET]
/data/anonymizer/tuning(this will serveX_tune) - [GET]
/data/anonymizer/validation(this will serveX_val) - [GET]
/data/deanonymizer/members(this will serve a subset ofX_trainandy_train) - [GET]
/data/deanonymizer/non-members(this will serve a subset ofX_valandy_val) - [GET]
/data/attack-success-evaluation(this will serve data from the complement of members and also from the complement of non-members). - [POST]
/utility/anonymizer(this will evaluate the quality of an anonymizer's prediction compared toy_tuneory_val.) - [POST]
/utility/deanonymizer(this will evaluate the quality of a denanonymizer's prediction compared to attack-success-evaluation)
Dataset Anonymization Challenge with ??? Threat Model
TODO
Dataset Synthesis Challenge with ??? Threat Model
TODO
How To Create Challenge Web Servers Without Template
In case your challenge is not covered by one of the available templates, we suggest that you also use the Falcon web framework and make use of at least some of the available resource building blocks. Besides that, you should pay attention to the following principles when implementing your challenge:
- Challenges and submissions will not get any internet access when running on the AnoMed platform. Make your challenge self-containing.
- Platform users will not get access to challenge data - only submission containers (but not their creators) may access them. That means submission contributors have to create their model blueprints »blindly« and are not able to have a look into the data when hyperparameter tuning. To make life a little easier, we suggest to provide dummy data of the same type and shape as the challenge – but with innocuous content outside of the platform. You may post a hyperlink to it from within your challenge description.
- Explain your API well in the challenge description, such that custom submissions have it easy to obey your API. Template anonymizers and template deanonymizers are likely incompatible with your custom challenge.
- Provide a default route
/which returns a JSON encoded message like "Challenge server is alive!" for diagnosis, upon GET request. - Challenge data used to fit and evaluate anonymizers or deanonymizers should be the same for each submission, to allow for a fair comparison.
- Evaluation data should be disjoint from training data.
- Utility and privacy metrics should be floating point scalars, to allow for plotting and ranking. Vectors or even more complex statistics are not suitable for that. Also, they should be clearly defined and fixed before the first submission comes in. It should not be changed retroactively.
- There should be a way to gain intermediate evaluation results for hyperparameter tuning (e.g. with respect to tuning data, if there is any). The final evaluation however should be accessed only once by each submission, to limit validation data leakage. Such further requests should be rejected.
- Try to find a good compromise between required network capacity and ease of use, when sending data via web to submissions. For example, when sending raw NumPy arrays over wire, or even plain JSON, no compression is used and usually the required network capacity is large. Compressed files on the other hand might require further processing in downstream tasks. In the supervised learning scenario above for example, we used compressed streams of NumPy arrays and utility functions to make working with them comfortable.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anomed_challenge-0.0.19.tar.gz.
File metadata
- Download URL: anomed_challenge-0.0.19.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
225f2769f69f74edc1bb650bbbbd230f68846e4e2f5c26076f4b426f8aebc901
|
|
| MD5 |
68b421b6169acd8287da3ce4a748c568
|
|
| BLAKE2b-256 |
c8c26f1b533351782c8dc040e56a2004f0bd9440be9428f577b7eaf11777ac9a
|
Provenance
The following attestation bundles were made for anomed_challenge-0.0.19.tar.gz:
Publisher:
python-publish.yml on ypotdevin/anomed_challenge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anomed_challenge-0.0.19.tar.gz -
Subject digest:
225f2769f69f74edc1bb650bbbbd230f68846e4e2f5c26076f4b426f8aebc901 - Sigstore transparency entry: 177470458
- Sigstore integration time:
-
Permalink:
ypotdevin/anomed_challenge@f04d66b0afa85b1d7a8fa33f6fefd575ccd95623 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/ypotdevin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f04d66b0afa85b1d7a8fa33f6fefd575ccd95623 -
Trigger Event:
release
-
Statement type:
File details
Details for the file anomed_challenge-0.0.19-py3-none-any.whl.
File metadata
- Download URL: anomed_challenge-0.0.19-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af483262971978ff01399d43c31ba754ec99b68c6b9862366590c733147e51c8
|
|
| MD5 |
9f0b169443106bc9fa1f5589e2ad5621
|
|
| BLAKE2b-256 |
a5f46a212a90deb0ef19889a84292656c4606d177e97ec3c1735327fd93be0eb
|
Provenance
The following attestation bundles were made for anomed_challenge-0.0.19-py3-none-any.whl:
Publisher:
python-publish.yml on ypotdevin/anomed_challenge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anomed_challenge-0.0.19-py3-none-any.whl -
Subject digest:
af483262971978ff01399d43c31ba754ec99b68c6b9862366590c733147e51c8 - Sigstore transparency entry: 177470462
- Sigstore integration time:
-
Permalink:
ypotdevin/anomed_challenge@f04d66b0afa85b1d7a8fa33f6fefd575ccd95623 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/ypotdevin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f04d66b0afa85b1d7a8fa33f6fefd575ccd95623 -
Trigger Event:
release
-
Statement type: