Skip to main content

Command Line Interface for projects

Project description

made-with-python python-version ruff

⚗️ ChemMatData

ChemMatData Logo

The chem_mat_data package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. The aim of this package is to provide the datasets in a unified format suitable to machine learning applications and specifically to train graph neural networks (GNNs).

Specifically, chem_mat_data addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download datasets either in raw or in processed (graph) format.

Features:

Getting ready to train a PyTorch Geometric model can be as easy as this:

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')

# Convert the graph dicts into PyG Data objects
data_list: list[Data] = pyg_data_list_from_graphs(graphs)
data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)

# Network training...

📦 Pip Installation

Install the latest stable release using pip from the Python Package Index (PyPI):

pip install chem_mat_database

Or install the latest development versin directly from the GitHub repository:

pip install git+https://github.com/the16thpythonist/chem_mat_data.git

⌨️ Command Line Interface (CLI)

The package provides the cmdata command line interface (CLI) to interact with the remote database.

To see the list of all available commands, simply use the --help flag:

cmdata --help

Listing Available Datasets

To which datasets are available to be downloaded from the remote file share server, use the list command:

cmdata list

This will print a table containing all the dataset which are currently available to download from the database. Each row of the table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of target properties as additional columns.

Downloading Datasets

Finally, to download this dataset, use the download command:

cmdata donwload "clintox"

This will download the dataset clintox.csv dataset file to your current working directory.

One can also specify the path to wich the dataset should be downloaded as following:

cmdata download --path="/tmp" "clintox"

🚀 Quickstart

Alternatively, the chem_mat_data functionality can be used programmatically as part of python code. The package provides each dataset either in raw or processed/graph format (For further information on the distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).

Raw Datasets

You can use the load_smiles_dataset function to download the raw dataset format. This function will return the dataset as a pandas.DataFrame object which contains a “smiles” column along with the specific target value annotations as separate data frame columns.

import pandas as pd
from chem_mat_data import load_smiles_dataset

df: pd.DataFrame = load_smiles_dataset('clintox')
print(df.head())

Graph Datasets

You can also use the load_graph_dataset function to download the same dataset in the pre-processed graph representation. This function will return a list of dict objects which contain the full graph representation of the corresponding molecules.

from rich.pretty import pprint
from chem_mat_data import load_graph_dataset

graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).

Training Graph Neural Networks

Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the PyTorch Geometric library with the dataset loaded from the chem_mat_data package.

from torch import Tensor
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn.models import GIN
from rich.pretty import pprint

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

# Convert the graph dicts into PyG Data objects
data_list = pyg_data_list_from_graphs(graphs)
data_loader = DataLoader(data_list, batch_size=32, shuffle=True)

# Construct a GNN model
model = GIN(
    in_channels=example_graph['node_attributes'].shape[1],
    out_channels=example_graph['graph_labels'].shape[0],
    hidden_channels=32,
    num_layers=3,
)

# Perform model forward pass with a batch of graphs
data: Data = next(iter(data_loader))
out_pred: Tensor = model.forward(
    x=data.x,
    edge_index=data.edge_index,
    batch=data.batch
)
pprint(out_pred)

🤝 Credits

We thank the following packages, institutions and individuals for their significant impact on this package.

  • PyComex is a micro framework which simplifies the setup, processing and management of computational experiments. It is also used to auto-generate the command line interface that can be used to interact with these experiments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mat_database-1.6.0.tar.gz (859.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chem_mat_database-1.6.0-py3-none-any.whl (480.7 kB view details)

Uploaded Python 3

File details

Details for the file chem_mat_database-1.6.0.tar.gz.

File metadata

  • Download URL: chem_mat_database-1.6.0.tar.gz
  • Upload date:
  • Size: 859.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for chem_mat_database-1.6.0.tar.gz
Algorithm Hash digest
SHA256 be5783c4976d2ba44707167ccdf14b401c2fa807d374cec911ef88752e71f54b
MD5 257c33180bc325bce1936a2b255e8f0f
BLAKE2b-256 c606089b689fe78b9ca1dc3f53d2b9dab77b3cd2aca71c93e1bc09a3404536f9

See more details on using hashes here.

File details

Details for the file chem_mat_database-1.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chem_mat_database-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab203ec57685555f820aef6f0fc79bd9c43078c85735643693a78702e00c4bc0
MD5 a42fbad7a84e792c3bbc3c7e35cb23db
BLAKE2b-256 36d0eeb5624198d14cc9de1dc8c924614784f3c478907e027eeafda6b6030e2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page