A collection of scripts to handle the Papyrus bioactivity dataset
Project description
Papyrus-scripts
Collection of scripts to interact with the Papyrus bioactivity dataset.
Associated Article: 10.1186/s13321-022-00672-x
Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.
Papyrus - A large scale curated dataset aimed at bioactivity predictions.
J Cheminform 15, 3 (2023). https://doi.org/10.1186/s13321-022-00672-x
Associated Preprint: 10.33774/chemrxiv-2021-1rxhk
Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.
Papyrus - A large scale curated dataset aimed at bioactivity predictions.
ChemRxiv. Cambridge: Cambridge Open Engage; 2021;
This content is a preprint and has not been peer-reviewed.
Installation
pip install papyrus-scripts
:warning: If pip gives the following error and resolves in import errors
Defaulting to user installation because normal site-packages is not writeable
Then uninstall and reinstalling the library with the following commands:
pip uninstall -y papyrus-scripts
python -m pip install papyrus-scripts
Additional dependencies can be installed to allow:
-
similarity and substructure searches
conda install FPSim2 openbabel h5py cupy -c conda-forge
-
training DNN models:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Getting started
The new application programming interface (API)
This new object-oriented API is available since version 2.0.0.
It allows for easier filtering of the Papyrus data and ensures that any data being queried is downloaded.
from papyrus_scripts import PapyrusDataset
data = (PapyrusDataset(version='05.7', plusplus=True) # Downloads the data if needed
.keep_source(['chembl', 'sharma']) # Keep specific sources
.keep_quality('high')
.proteins() # Get the corresponding protein targets
)
Functional API (legacy)
The functional API requires the data to be downloaded beforehand.
One can donwload the dataset either with the functional API itself or the command line interface (CLI).
Donwloading with the command line interface (CLI)
The following command will download the Papyrus++ bioactivities and protein targets (high-quality Ki and KD data as well as IC50 and EC50 of reproducible assays) for the latest version.
papyrus download -V latest
The following command will donwload the entire set of high-, medium-, and low-quality bioactivities and protein targets along with all precomputed molecular and protein descriptors for version 05.5.
papyrus download -V 05.5 --more --d all
The following command will download Papyrus++ bioactivities, protein targets and compound structures for both version 05.4 and 05.5.
papyrus download -V 05.5 -V 05.4 -S
More options can be found using
papyrus download --help
By default, the data is downloaded to pystow's default directory.
One can override the folder path by specifying the -o
switch in the above commands.
Donwloading with the functional API
from papyrus_scripts import download_papyrus
# Donwload the latest version of the entire dataset with all precomputed descriptors
download_papyrus(version='latest', only_pp=False, structures=True, descriptors='all')
Querying with the functional API
The query detailed above using the object-oriented API is reproduced below using the functional API.
from papyrus_scripts import (read_papyrus, read_protein_set,
keep_quality, keep_source, keep_type,
keep_organism, keep_accession, keep_protein_class,
keep_match, keep_contains,
consume_chunks)
chunk_reader = read_papyrus(version='05.7', plusplus=True, is3d=False, chunksize=1_000_000)
protein_data = read_protein_set(version='05.7')
filter1 = keep_source(data=chunk_reader, source=['chembl', 'sharma'])
filter2 = keep_quality(data=filter1, min_quality='high')
data = consume_chunks(filter2, progress=False)
protein_data = protein_data.set_index('target_id').loc[data.target_id.unique()].reset_index()
Versions of the Papyrus dataset
Different online servers host the Papyrus data based on release and ChEMBL version (table below).
Papyrus version | ChEMBL version | Zenodo | 4TU |
---|---|---|---|
05.4 | 29 | :heavy_check_mark: | :heavy_check_mark: |
05.5 | 30 | :heavy_check_mark: | :x: |
05.6 | 31 | :heavy_check_mark: | :x: |
05.7 | 34 | :heavy_check_mark: | :x: |
Precomputed molecular and protein descriptors along with molecular structures (2D for default set and 3D for low quality set with stereochemistry) are not available for version 05.4 from 4TU but are from Google Drive.
As stated in the pre-print we strongly encourage the use of the dataset in which stereochemistry was not considered. This corresponds to files containing the mention "2D" and/or "without_stereochemistry".
Interconversion of the compressed files
The available LZMA-compressed files (.xz) may not be supported by some software (e.g. Pipeline Pilot).
Decompressing the data is strongly discouraged!
Though Gzip files were made available at 4TU for version 05.4, we now provide a CLI option to locally interconvert from LZMA to Gzip and vice-versa.
To convert from LZMA to Gzip (or vice-versa) use the following command:
papyrus convert -v latest
Removal of the data
One can remove the Papyrus data using either the CLI or the API.
The following exerts exemplify the removal of all Papyrus data files, including all versions utility files.
papyrus clean --remove_root
from papyrus_scripts import remove_papyrus
remove_papyrus(papyrus_root=True)
Easy handling of the dataset
Once installed the Papyrus-scripts allow for the easy filtering of the data.
- Simple examples can be found in the simple_examples.ipynb notebook.
- An example on matching data with the Protein Data Bank can be found in the simple_examples.ipynb notebook.
- More advanced examples will be added to the advanced_querying.ipynb notebook.
Reproducing results of the pre-print
The scripts used to extract subsets, generate models and obtain visualizations can be found here.
Features to come
- Substructure and similarity molecular searches
- ability to use DNN models
- ability to repeat model training over multiple seeds
- y-scrambling
- adapt models to QSPRpred
Examples to come
- Use of custom grouping schemes for training/test set splitting and cross-validation
- Use custom molecular and protein descriptors (either Python function or file on disk)
Logos
Logos can be found under figures/logo Two version exist depending on the background used.
:warning: GitHub does not render the white logo properly in the table below but should not deter you from using it!
On white background | On colored background |
---|---|
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file papyrus_scripts-2.1.0.tar.gz
.
File metadata
- Download URL: papyrus_scripts-2.1.0.tar.gz
- Upload date:
- Size: 78.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa7d3c835d4a63d254541178ea5b9a8d24f1b9182f97b5034eb85d5266cfb8ad |
|
MD5 | b3e4f36b6276e6a5a2d0f1a1278556b1 |
|
BLAKE2b-256 | ceac2a01174cfac28e03cbd72dfa18ba99b46f4141e0065b141ec4b5c05e4961 |
File details
Details for the file papyrus_scripts-2.1.0-py3-none-any.whl
.
File metadata
- Download URL: papyrus_scripts-2.1.0-py3-none-any.whl
- Upload date:
- Size: 79.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 951ae98b592318f55261360c79d2bb0d01008657481506b2e68247fdebebb464 |
|
MD5 | 70119b076ff4efef02c3571d6487d986 |
|
BLAKE2b-256 | 389097998d0619abeb876dc3e116cc5ef77ce7af1b6f90ece79c210736ab251b |