Skip to main content

A common data access layer for AI-driven drug discovery.

Project description

📊 AionData

AionData is a common data access layer designed for AI-driven drug discovery software. It provides a unified interface to access diverse biochemical databases.

Installation

To install AionData, ensure you have Python 3.10 or newer installed on your system. You can install AionData via pip:

pip install aiondata

Datasets

AionData provides access to the following datasets:

  • BindingDB: A public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be drug-targets with small, drug-like molecules.

  • UniProt (Universal Protein Resource): UniProt provides a comprehensive, high-quality and freely accessible resource of protein sequence and functional information, which includes the manually annotated and reviewed dataset UniProtKB/Swiss-Prot.

  • ZINC: ZINC is a free database of commercially-available compounds for virtual screening.

  • MoleculeNet: An extensive collection of datasets curated to support and benchmark the development of machine learning models in the realm of drug discovery and chemical informatics. Covers a broad spectrum of molecular data including quantum mechanical properties, physical chemistry, biophysics, and physiological effects.

    • Tox21: Features qualitative toxicity measurements for 12,000 compounds across 12 targets, used for toxicity prediction.
    • ToxCast: ToxCast is a large-scale dataset for toxicity prediction, which includes over 600 experiments across 185 assays.
    • ESOL: Contains water solubility data for 1,128 compounds, aiding in solubility prediction models.
    • FreeSolv: Provides experimental and calculated hydration free energy for small molecules, crucial for understanding solvation.
    • Lipophilicity: Includes experimental measurements of octanol/water distribution coefficients (logD) for 4,200 compounds.
    • QM7: A dataset of 7,165 molecules with quantum mechanical properties computed using density functional theory (DFT).
    • QM8: Features electronic spectra and excited state energies of over 20,000 small molecules computed with TD-DFT.
    • QM9: Offers geometric, energetic, electronic, and thermodynamic properties of ~134k molecules computed with DFT.
    • MUV: Datasets designed for the validation of virtual screening techniques, with about 93,000 compounds.
    • HIV: Contains data on the ability of compounds to inhibit HIV replication, for binary classification tasks.
    • BACE: Includes quantitative binding results for inhibitors of human beta-secretase 1, with both classification and regression tasks.
    • BBBP: Features compounds with information on permeability properties across the Blood-Brain Barrier.
    • SIDER: Contains information on marketed medicines and their recorded adverse drug reactions, for side effects prediction.
    • ClinTox: Compares drugs approved by the FDA and those that failed clinical trials for toxicity reasons, for binary classification and toxicity prediction.
  • PDB (Protein Data Bank): A comprehensive, publicly available repository of 3D structural data of biological molecules. This dataset includes atomic coordinates, biological macromolecules, and complex assemblies, which are essential for understanding molecular function and designing pharmaceuticals.

  • Foldswitch Proteins: Datasets from the paper AlphaFold2 fails to predict protein fold switching featuring information on fold-switching proteins. These datasets provide insights into the structural dynamics and functional versatility of proteins, highlighting cases where AlphaFold2's predictive capabilities are challenged.

    • Table S1A: Lists pairs of proteins (PDBIDs), their lengths, and the sequence of the fold-switching region. For some pairs, only the first fold's PDBID is available if the second fold has not been solved.
    • Table S1B: Offers RMSD and TM-scores for the whole protein and the fold-switching fragment specifically, along with sequence identities between the fold-switching pairs.
    • Table S1C: Provides a list of fold-switching protein pairs (PDBID and chain) used for analysis, including TM-scores of the predictions.
  • CodNas91: A dataset curated from the paper Impact of protein conformational diversity on AlphaFold predictions, featuring 91 proteins with varying degrees of conformational diversity. This dataset focuses on apo–holo pairs selected for their significant structural changes associated with biological processes.

  • Weizmann 3CA: Curated Cancer Cell Atlas of collected, annotated and analyzed cancer scRNA-seq datasets from the Weizmann Institute of Science.

License

AionData is licensed under the Apache License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiondata-0.7.1.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

aiondata-0.7.1-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file aiondata-0.7.1.tar.gz.

File metadata

  • Download URL: aiondata-0.7.1.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for aiondata-0.7.1.tar.gz
Algorithm Hash digest
SHA256 1b52816d5092967320f520964917a8fea56897f95e9f4904406f81f0cbc47aef
MD5 893e7379474fc2b89e9ddbcd6927b8e7
BLAKE2b-256 ef537267371d73f5ec8bcd5dfd3062598cd3776b7b8826cb81b4b6e0b77c01ca

See more details on using hashes here.

File details

Details for the file aiondata-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: aiondata-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/6.5.0-1025-azure

File hashes

Hashes for aiondata-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1f38b9545d6450fc6b06f236e3004aaa94ff5da99b573a4a532c746b0d9b7781
MD5 a8a5fa91c55e73b0c816a4dc7a4f51aa
BLAKE2b-256 22f42bc061a1343d5404fbf4e09bcc43728c0559e8af6ac2d3c4bd587b6f47d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page