A rich molecule dataset for Blood-Brain Barrier (BBB) permeability.
Project description
About B3DB
In this repo, we present a large benchmark dataset, Blood-Brain Barrier Database (B3DB), compiled from 50 published resources (as summarized at raw_data/raw_data_summary.tsv) and categorized based on the consistency between different experimental references/measurements. This dataset was published in Scientific Data and this repository is occasionally uploaded with new experimental data. Scientists who would like to contribute data should contact the database's maintainers (e.g., by creating a new Issue in this database).
A subset of the
molecules in B3DB has numerical logBB values (1058 compounds), while the whole dataset
has categorical (BBB+ or BBB-) BBB permeability labels (7807 compounds). Some physicochemical properties
of the molecules are also provided.
Citation
Please use the following citation in any publication using our B3DB dataset:
@article{Meng_A_curated_diverse_2021,
author = {Meng, Fanwang and Xi, Yang and Huang, Jinfeng and Ayers, Paul W.},
doi = {10.1038/s41597-021-01069-5},
journal = {Scientific Data},
number = {289},
title = {A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors},
volume = {8},
year = {2021},
url = {https://www.nature.com/articles/s41597-021-01069-5},
publisher = {Springer Nature}
}
Features of B3DB
-
The largest dataset with numerical and categorical values for Blood-Brain Barrier small molecules (to the best of our knowledge, as of February 25, 2021).
-
Inclusion of stereochemistry information with isomeric SMILES with chiral specifications if available. Otherwise, canonical SMILES are used.
-
Characterization of uncertainty of experimental measurements by grouping the collected molecular data records.
-
Extended datasets for numerical and categorical data with precomputed physicochemical properties using mordred.
Usage
There are two types of dataset in B3DB, regression data and classification data and they can be loaded simply using pandas. For example
import pandas as pd
# load regression dataset
regression_data = pd.read_csv("B3DB/B3DB_regression.tsv",
sep="\t")
# load classification dataset
classification_data = pd.read_csv("B3DB/B3DB_classification.tsv",
sep="\t")
# load extended regression dataset
regression_data_extended = pd.read_csv("B3DB/B3DB_regression_extended.tsv.gz",
sep="\t", compression="gzip")
# load extended classification dataset
classification_data_extended = pd.read_csv("B3DB/B3DB_classification_extended.tsv.gz",
sep="\t", compression="gzip")
We also have three examples to show how to use our dataset,
numerical_data_analysis.ipynb,
PCA_projection_fingerprint.ipynb and
PCA_projection_descriptors.ipynb.
PCA_projection_descriptors.ipynb uses precomputed
chemical descriptors for visualization of chemical space of B3DB, and can be used directly
using MyBinder,
.
Due to the difficulty of installing
RDKit in MyBinder, only PCA_projection_descriptors. ipynb is set up in MyBinder.
Working environment setting up
All the calculations were performed in a Python 3.7.9 virtual environment created with Conda in CentOS Linux release 7.9.2009. The Conda environment includes the following Python packages,
- ChEMBL_Structure_Pipeline==1.0.0, https://github.com/chembl/ChEMBL_Structure_Pipeline/
- RDKit==2020.09.1, https://www.rdkit.org/
- openeye-toolkit==2020.2.0, https://docs.eyesopen.com/toolkits/python/index.html/
- mordred==1.1.2, https://github.com/mordred-descriptor/mordred/ (required networkx==2.3.0)
- numpy==1.19.2, https://numpy.org/
- pandas==1.2.1, https://pandas.pydata.org/
- pubchempy==1.0.4, https://github.com/mcs07/PubChemPy/
- PyTDC==0.1.5, https://github.com/mims-harvard/TDC/
- SciPy==1.10.0, https://www.scipy.org/
- tabula-py==2.2.0, https://pypi.org/project/tabula-py/
To creat a virtual environment named bbb_data with Python 3.7.9 to this specification, first,
conda create bbb_py37 python=3.7.9
Given that RDKit, ChEMBL_Structure_Pipeline are not available in PyPI and we will install
them with conda,
# activate a virtual environment
conda activate bbb_py37
conda install -c rdkit rdkit=2020.09.1.0
conda install -c conda-forge chembl_structure_pipeline=1.0.0
# https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html
conda install -c openeye openeye-toolkits=2020.2.0
Then we can install the requirements in requirements.txt with
pip install -r requirements.txt
An easier way is to run the follow script with bash,
#!/bin/bash
# create virtual environment
conda create bbb_py37 python=3.7.9
# activate virtual environment
conda activate bbb_py37
# install required packages
conda install -c rdkit rdkit=2020.09.1.0
conda install -c conda-forge chembl_structure_pipeline=1.0.0
# https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html
conda install -c openeye openeye-toolkits=2020.2.0
pip install -r requirements.txt
ALOGPS version 2.1 can be accessed at http://www.vcclab.org/lab/alogps/.
The materials and data under this repo are distributed under the CC0 Licence.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qc_b3db-0.1.0a6.tar.gz.
File metadata
- Download URL: qc_b3db-0.1.0a6.tar.gz
- Upload date:
- Size: 78.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b57a663dd74504e83c8bd0fb42d815295626183590aca3bbf157954fea1406a
|
|
| MD5 |
c4d83e36fe198443a3fe5ecef996c2c5
|
|
| BLAKE2b-256 |
7a4857c44b0ca4f807be8a602e54a14c24dc956e36642e8a28e9fb05eb58a40b
|
Provenance
The following attestation bundles were made for qc_b3db-0.1.0a6.tar.gz:
Publisher:
pypi_release.yaml on theochem/B3DB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qc_b3db-0.1.0a6.tar.gz -
Subject digest:
1b57a663dd74504e83c8bd0fb42d815295626183590aca3bbf157954fea1406a - Sigstore transparency entry: 403354447
- Sigstore integration time:
-
Permalink:
theochem/B3DB@e346a0c55630e672e7ca2b4883f038742535a147 -
Branch / Tag:
refs/tags/v0.1.0a6 - Owner: https://github.com/theochem
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yaml@e346a0c55630e672e7ca2b4883f038742535a147 -
Trigger Event:
push
-
Statement type:
File details
Details for the file qc_b3db-0.1.0a6-py3-none-any.whl.
File metadata
- Download URL: qc_b3db-0.1.0a6-py3-none-any.whl
- Upload date:
- Size: 78.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4ca2db76d9832c2b76e4f48facb872ca2fcb9c2cfd8838776a9a497758c637
|
|
| MD5 |
f4adc1d13fa243045a0ea5d04c9d88d0
|
|
| BLAKE2b-256 |
52afd8dbdbc81c8ddf9d10d3c3bdbc0dd1135f306d01d58209811c7f08158bd8
|
Provenance
The following attestation bundles were made for qc_b3db-0.1.0a6-py3-none-any.whl:
Publisher:
pypi_release.yaml on theochem/B3DB
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qc_b3db-0.1.0a6-py3-none-any.whl -
Subject digest:
1e4ca2db76d9832c2b76e4f48facb872ca2fcb9c2cfd8838776a9a497758c637 - Sigstore transparency entry: 403354472
- Sigstore integration time:
-
Permalink:
theochem/B3DB@e346a0c55630e672e7ca2b4883f038742535a147 -
Branch / Tag:
refs/tags/v0.1.0a6 - Owner: https://github.com/theochem
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yaml@e346a0c55630e672e7ca2b4883f038742535a147 -
Trigger Event:
push
-
Statement type: