Assessing training data privacy for molecular property prediction.

These details have not been verified by PyPI

Project links

Project description

Training data privacy assessment for molecular property prediction

Python package to assess how much information somebody can deduct from a neural network trained on some (confidential) training data for molecular property prediction.

Getting Started

These instructions will help you install the python package and conduct a privacy assessments on data for molecular property prediction.

Prerequisites

You need to have an environment with python version 3.12. This can be created for exampple using conda.

$ conda create -n privacy_env python=3.12 
$ conda activate privacy_env

Installation

Package can be installed from PyPI with pip

$ pip install molprivacy

Usage

The package can be run via the command line or by importing the privacy_test function into your script.

Command line interface

You can run the privacy test directly from the command line using the privacytest command after installing the package.

$ privacytest --representation REPRESENTATION --result_folder RESULT_FOLDER [options]

Required Arguments

--representation: Specifies the molecular representation to use. Choices are:

ECFP4
ECFP6
MACCS
graph
rdkit
transformer_vector
transformer_matrix

--result_folder: Path to the folder where the results will be stored.

Optional Arguments

--dataset: Specifies the dataset to use. Choices are:

ames (default)
herg
del
bbb
file (use your own dataset; requires --dataset_path)

--dataset_path: Path(s) to your custom dataset file(s). Required if --dataset file is selected. The dataset must have a 'smiles' column and a binary 'label' column.

--split: Split ratios for training, validation, and testing datasets. Provide three float values that sum to 1.0. Testing dataset will be used for privacy assessment. Default: 0.45 0.1 0.45

--hyperparameter_optimization_time: Time in seconds allocated for hyperparameter optimization during model training. Default: 600

--attack_data_fraction: Fraction of data to use for the Reverse Model Inversion Attack (RMIA). Reduce this value if the RMIA attack runs out of memory. Default: 1.0

Examples

Run with default dataset and parameters:

$ privacytest --representation ECFP4 --result_folder "home/results"

Use a custom dataset:

$ privacytest --representation MACCS --result_folder "home/results" --dataset file --dataset_path "home/data/my_dataset.csv"

Help

For a full list of available options and detailed descriptions, run:

$ privacytest --help

Import into a script

In addition to running the privacy test from the command line, you can directly use the privacy_test function in your Python scripts. This allows for seamless integration into your workflows and the ability to to customize parameters programmatically.

Import the function

from privacytest.__main__ import privacy_test

Example usage

from privacytest.__main__ import privacy_test
# Import your custom representation function (needs to have a smiles string as input and a tuple of the encoding vector and the vectors dimension as an output)
from mycode import my_custom_representation_function

# Define privacy test parameters
representation = 'custom'
result_folder = 'home/results'
dataset = 'ames'
split = [0.45, 0.1, 0.45]
hyperparameter_optimization_time = 600
attack_data_fraction = 1.0
custom_representation_function = my_custom_representation_function

# Run the privacy test with custom representation
privacy_test(
    representation=representation,
    result_folder=result_folder,
    dataset=dataset,
    split=split,
    hyperparameter_optimization_time=hyperparameter_optimization_time,
    attack_data_fraction=attack_data_fraction,
    custom_representation_function=custom_representation_function
)

Privacy Test Output

When you run the privacy test, the results are saved in the result_folder you specified. The folder structure contains the following files and directories:

`privacy/`

results/: This folder contains the outputs related to privacy performance for both the LiRA and RMIA privacy attacks.
- lira/:
  - privacy_performance_overview.txt: A text file summarizing the results of the LiRA privacy attack.
  - privacy_performance.pdf: ROC curve for the LiRA attack (positives are training data samples).
  - ROC.csv: CSV file containing data of ROC curve.
- rmia/:
  - privacy_performance_overview.txt: A text file summarizing the results of the RMIA privacy attack.
  - privacy_performance.pdf: ROC curve for the RMIA attack.
  - ROC.csv: CSV file containing data of ROC curve.
- true_positives_at_FPR0/:
  - lira.csv: CSV file that contains all the chemical structures that LiRA could identify at a False Positive Rate of 0.
  - rmia.csv: CSV file that contains all the chemical structures that RMIA could identify at a False Positive Rate of 0.

`model/`

model_performance.pdf: ROC curve of the model performance in the binary classification task.
model.ckpt: The checkpoint file for the final model that was trained with the optimized hyperparameters.
optimization.db: Database containing the optimization history during hyperparameter tuning.
optimized_hyperparameters.yaml: A YAML file that contains the optimized hyperparameters after the search process.

`data_dir/`

train.csv, validation.csv, test.csv: These files contain the datasets used for training, validation, and testing. Will vary when re-running the package since the split is random.

`model_config.yaml`

A configuration file that summarizes all model related configurations.

`privacy_config.yaml`

A configuration file that summarizes all membership inference attack related configurations.

Citation

This repository is part of the paper "Publishing neural networks in drug discovery compromises training data privacy". \ Pre-print: TODO \ Bibtex: TODO

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Feb 24, 2025

This version

0.1.0

Oct 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molprivacy-0.1.0.tar.gz (22.4 MB view details)

Uploaded Oct 18, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

molprivacy-0.1.0-py3-none-any.whl (22.9 MB view details)

Uploaded Oct 18, 2024 Python 3

File details

Details for the file molprivacy-0.1.0.tar.gz.

File metadata

Download URL: molprivacy-0.1.0.tar.gz
Upload date: Oct 18, 2024
Size: 22.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for molprivacy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1b8cdc29b6e81e2309daaf8f6dc9b11f96a7ecfdc5a58a83830e0b2bab109429`
MD5	`8d33aa9f34b760409c43abcaedc2a155`
BLAKE2b-256	`de1a444d3c9279dcb6fae5b0915360a9d965d1d6315e77aa6980a10555bc9f98`

See more details on using hashes here.

File details

Details for the file molprivacy-0.1.0-py3-none-any.whl.

File metadata

Download URL: molprivacy-0.1.0-py3-none-any.whl
Upload date: Oct 18, 2024
Size: 22.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for molprivacy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7aa4f4d80f41a66a2b5619afc5cbf87ed38a3e248d3be4e08509e90e96867b04`
MD5	`32a0aeb51291fd1b93d14a084a5b03da`
BLAKE2b-256	`192b8c340effd3822730bb11e6419b590af3e6d2a47f7433fdf170975d8adcb4`

See more details on using hashes here.

molprivacy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Training data privacy assessment for molecular property prediction

Getting Started

Prerequisites

Installation

Usage

Command line interface

Required Arguments

Optional Arguments

Examples

Help

Import into a script

Import the function

Example usage

Privacy Test Output

privacy/

model/

data_dir/

model_config.yaml

privacy_config.yaml

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`privacy/`

`model/`

`data_dir/`

`model_config.yaml`

`privacy_config.yaml`