Skip to main content

A collection of scripts to handle the Papyrus bioactivity dataset

Project description

Papyrus-scripts

Collection of scripts to interact with the Papyrus bioactivity dataset.

alt text


Associated Article: 10.1186/s13321-022-00672-x

Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.
Papyrus - A large scale curated dataset aimed at bioactivity predictions.
J Cheminform 15, 3 (2023). https://doi.org/10.1186/s13321-022-00672-x

Associated Preprint: 10.33774/chemrxiv-2021-1rxhk

Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van de Water B, van Westen GJP.
Papyrus - A large scale curated dataset aimed at bioactivity predictions.
ChemRxiv. Cambridge: Cambridge Open Engage; 2021;
This content is a preprint and has not been peer-reviewed.

Installation

pip install papyrus-scripts

:warning: If pip gives the following error and resolves in import errors

Defaulting to user installation because normal site-packages is not writeable

Then uninstall and reinstalling the library with the following commands:

pip uninstall -y papyrus-scripts
python -m pip install papyrus-scripts

Additional dependencies can be installed to allow:

  • similarity and substructure searches

    conda install FPSim2 openbabel h5py cupy -c conda-forge
    
  • training DNN models:

    conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
    

Donwload the dataset

The Papyrus data can be donwload in three different ways.
The use of the command line interface is strongly recommended to download the data.

- Using the command line interface (CLI)

Once the library is installed (see Installation), one can easily download the data.

  • The following command will download the Papyrus++ bioactivities and protein targets (high-quality Ki and KD data as well as IC50 and EC50 of reproducible assays) for the latest version.
papyrus download -V latest
  • The following command will donwload the entire set of high-, medium-, and low-quality bioactivities and protein targets along with all precomputed molecular and protein descriptors for version 05.5.
papyrus download -V 05.5 --more --d all 
  • The following command will download Papyrus++ bioactivities, protein targets and compound structures for both version 05.4 and 05.5.
papyrus download -V 05.5 -V 05.4 -S 

More options can be found using

papyrus download --help 

By default, the data is downloaded to pystow's default directory.
One can override the folder path by specifying the -o switch in the above commands.

- Using the application programming interface (API)

from papyrus_scripts import download_papyrus

# Donwload the latest version of the entire dataset with all precomputed descriptors
download_papyrus(version='latest', only_pp=False, structures=True, descriptors='all')

- Directly from online archives

Different online servers host the Papyrus data based on release and ChEMBL version (table below).

Papyrus version ChEMBL version Zenodo 4TU Google Drive
05.4 29 :x: :heavy_check_mark: :heavy_check_mark:
05.5 30 :heavy_check_mark: :x: :heavy_check_mark:
05.6 31 :heavy_check_mark: :x: :x:

Precomputed molecular and protein descriptors along with molecular structures (2D for default set and 3D for low quality set with stereochemistry) are not available for version 05.4 from 4TU but are from Google Drive.

As stated in the pre-print we strongly encourage the use of the dataset in which stereochemistry was not considered. This corresponds to files containing the mention "2D" and/or "without_stereochemistry".

Interconversion of the compressed files

The available LZMA-compressed files (.xz) may not be supported by some software (e.g. Pipeline Pilot).
Decompressing the data is strongly discouraged!
Though Gzip files were made available at 4TU for version 05.4, we now provide a CLI option to locally interconvert from LZMA to Gzip and vice-versa.

To convert from LZMA to Gzip (or vice-versa) use the following command:

papyrus convert -v latest 

Removal of the data

One can remove the Papyrus data using either the CLI or the API.

The following exerts exemplify the removal of all Papyrus data files, including all versions utility files.

papyrus clean --remove_root
from papyrus_scripts import remove_papyrus

remove_papyrus(papyrus_root=True)

Easy handling of the dataset

Once installed the Papyrus-scripts allow for the easy filtering of the data.

Reproducing results of the pre-print

The scripts used to extract subsets, generate models and obtain visualizations can be found here.

Features to come

  • Substructure and similarity molecular searches
  • ability to use DNN models
  • ability to repeat model training over multiple seeds
  • y-scrambling
  • adapt models to QSPRpred

Examples to come

  • Use of custom grouping schemes for training/test set splitting and cross-validation
  • Use custom molecular and protein descriptors (either Python function or file on disk)

Logos

Logos can be found under figures/logo Two version exist depending on the background used.

:warning: GitHub does not render the white logo properly in the table below but should not deter you from using it!

On white background On colored background

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papyrus_scripts-1.0.3.tar.gz (69.6 kB view details)

Uploaded Source

Built Distribution

papyrus_scripts-1.0.3-py3-none-any.whl (72.5 kB view details)

Uploaded Python 3

File details

Details for the file papyrus_scripts-1.0.3.tar.gz.

File metadata

  • Download URL: papyrus_scripts-1.0.3.tar.gz
  • Upload date:
  • Size: 69.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for papyrus_scripts-1.0.3.tar.gz
Algorithm Hash digest
SHA256 f9d4d0664a2b8c310eff48cb16013c6e2636b91e50797e48064cf22b767f611b
MD5 65fc03d24ddd4d79af9a522ae6bd09cc
BLAKE2b-256 c0beef1fc8c69ba39ecf9d4a6d42ffcf25f07a7f8912fe1c15f1e77b40be8f90

See more details on using hashes here.

File details

Details for the file papyrus_scripts-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for papyrus_scripts-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e2f69727cc9d171532ce312be88f9192b90f67892098a65894b5e01c582e80c6
MD5 967fb75d4b95028b585d4b115366c3ce
BLAKE2b-256 8f3bfefbe990c401b882b359c3827b6b8df211ee4722139c16133308b00515e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page