Skip to main content

Command Line Interface for projects

Project description

made-with-python python-version ruff

⚗️ ChemMatData

ChemMatData Logo

The chem_mat_data package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. The aim of this package is to provide the datasets in a unified format suitable to machine learning applications and specifically to train graph neural networks (GNNs).

Specifically, chem_mat_data addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download datasets either in raw or in processed (graph) format.

Features:

Getting ready to train a PyTorch Geometric model can be as easy as this:

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')

# Convert the graph dicts into PyG Data objects
data_list: list[Data] = pyg_data_list_from_graphs(graphs)
data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)

# Network training...

📦 Pip Installation

Install the latest stable release using pip from the Python Package Index (PyPI):

pip install chem_mat_database

Or install the latest development versin directly from the GitHub repository:

pip install git+https://github.com/the16thpythonist/chem_mat_data.git

⌨️ Command Line Interface (CLI)

The package provides the cmdata command line interface (CLI) to interact with the remote database.

To see the list of all available commands, simply use the --help flag:

cmdata --help

Listing Available Datasets

To which datasets are available to be downloaded from the remote file share server, use the list command:

cmdata list

This will print a table containing all the dataset which are currently available to download from the database. Each row of the table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of target properties as additional columns.

Downloading Datasets

Finally, to download this dataset, use the download command:

cmdata donwload "clintox"

This will download the dataset clintox.csv dataset file to your current working directory.

One can also specify the path to wich the dataset should be downloaded as following:

cmdata download --path="/tmp" "clintox"

🚀 Quickstart

Alternatively, the chem_mat_data functionality can be used programmatically as part of python code. The package provides each dataset either in raw or processed/graph format (For further information on the distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).

Raw Datasets

You can use the load_smiles_dataset function to download the raw dataset format. This function will return the dataset as a pandas.DataFrame object which contains a “smiles” column along with the specific target value annotations as separate data frame columns.

import pandas as pd
from chem_mat_data import load_smiles_dataset

df: pd.DataFrame = load_smiles_dataset('clintox')
print(df.head())

Graph Datasets

You can also use the load_graph_dataset function to download the same dataset in the pre-processed graph representation. This function will return a list of dict objects which contain the full graph representation of the corresponding molecules.

from rich.pretty import pprint
from chem_mat_data import load_graph_dataset

graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).

Training Graph Neural Networks

Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the PyTorch Geometric library with the dataset loaded from the chem_mat_data package.

from torch import Tensor
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn.models import GIN
from rich.pretty import pprint

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

# Convert the graph dicts into PyG Data objects
data_list = pyg_data_list_from_graphs(graphs)
data_loader = DataLoader(data_list, batch_size=32, shuffle=True)

# Construct a GNN model
model = GIN(
    in_channels=example_graph['node_attributes'].shape[1],
    out_channels=example_graph['graph_labels'].shape[0],
    hidden_channels=32,
    num_layers=3,
)

# Perform model forward pass with a batch of graphs
data: Data = next(iter(data_loader))
out_pred: Tensor = model.forward(
    x=data.x,
    edge_index=data.edge_index,
    batch=data.batch
)
pprint(out_pred)

🤝 Credits

We thank the following packages, institutions and individuals for their significant impact on this package.

  • PyComex is a micro framework which simplifies the setup, processing and management of computational experiments. It is also used to auto-generate the command line interface that can be used to interact with these experiments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mat_database-1.1.2.tar.gz (6.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chem_mat_database-1.1.2-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file chem_mat_database-1.1.2.tar.gz.

File metadata

  • Download URL: chem_mat_database-1.1.2.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for chem_mat_database-1.1.2.tar.gz
Algorithm Hash digest
SHA256 ba072070ea2e3aa8f6cf912eea33ad716c8001c2451756d1e4567d73807c1941
MD5 eac8791341283eca7b1781da14abf7ef
BLAKE2b-256 47ca5c8abb2a59076e7420b17445db29824cbc9b16593fabb26883aba86ff917

See more details on using hashes here.

File details

Details for the file chem_mat_database-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for chem_mat_database-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 568b664a678c9d64ff7991f5e3c002439f6119135962c4c6c4aca388ecc76203
MD5 646e501161aad748efd079bce071e6a8
BLAKE2b-256 2840e47152934913464fd744a67e54f225611a1737d6e2e593fd118d5555c17a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page