Skip to main content

Command Line Interface for projects

Project description

made-with-python python-version ruff

⚗️ ChemMatData

ChemMatData Logo

The chem_mat_data package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. The aim of this package is to provide the datasets in a unified format suitable to machine learning applications and specifically to train graph neural networks (GNNs).

Specifically, chem_mat_data addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download datasets either in raw or in processed (graph) format.

Features:

Getting ready to train a PyTorch Geometric model can be as easy as this:

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')

# Convert the graph dicts into PyG Data objects
data_list: list[Data] = pyg_data_list_from_graphs(graphs)
data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)

# Network training...

📦 Pip Installation

Install the latest stable release using pip from the Python Package Index (PyPI):

pip install chem_mat_database

Or install the latest development versin directly from the GitHub repository:

pip install git+https://github.com/the16thpythonist/chem_mat_data.git

⌨️ Command Line Interface (CLI)

The package provides the cmdata command line interface (CLI) to interact with the remote database.

To see the list of all available commands, simply use the --help flag:

cmdata --help

Listing Available Datasets

To which datasets are available to be downloaded from the remote file share server, use the list command:

cmdata list

This will print a table containing all the dataset which are currently available to download from the database. Each row of the table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of target properties as additional columns.

Downloading Datasets

Finally, to download this dataset, use the download command:

cmdata donwload "clintox"

This will download the dataset clintox.csv dataset file to your current working directory.

One can also specify the path to wich the dataset should be downloaded as following:

cmdata download --path="/tmp" "clintox"

🚀 Quickstart

Alternatively, the chem_mat_data functionality can be used programmatically as part of python code. The package provides each dataset either in raw or processed/graph format (For further information on the distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).

Raw Datasets

You can use the load_smiles_dataset function to download the raw dataset format. This function will return the dataset as a pandas.DataFrame object which contains a “smiles” column along with the specific target value annotations as separate data frame columns.

import pandas as pd
from chem_mat_data import load_smiles_dataset

df: pd.DataFrame = load_smiles_dataset('clintox')
print(df.head())

Graph Datasets

You can also use the load_graph_dataset function to download the same dataset in the pre-processed graph representation. This function will return a list of dict objects which contain the full graph representation of the corresponding molecules.

from rich.pretty import pprint
from chem_mat_data import load_graph_dataset

graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).

Training Graph Neural Networks

Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the PyTorch Geometric library with the dataset loaded from the chem_mat_data package.

from torch import Tensor
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn.models import GIN
from rich.pretty import pprint

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

# Convert the graph dicts into PyG Data objects
data_list = pyg_data_list_from_graphs(graphs)
data_loader = DataLoader(data_list, batch_size=32, shuffle=True)

# Construct a GNN model
model = GIN(
    in_channels=example_graph['node_attributes'].shape[1],
    out_channels=example_graph['graph_labels'].shape[0],
    hidden_channels=32,
    num_layers=3,
)

# Perform model forward pass with a batch of graphs
data: Data = next(iter(data_loader))
out_pred: Tensor = model.forward(
    x=data.x,
    edge_index=data.edge_index,
    batch=data.batch
)
pprint(out_pred)

🤝 Credits

We thank the following packages, institutions and individuals for their significant impact on this package.

  • PyComex is a micro framework which simplifies the setup, processing and management of computational experiments. It is also used to auto-generate the command line interface that can be used to interact with these experiments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mat_database-1.1.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chem_mat_database-1.1.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file chem_mat_database-1.1.0.tar.gz.

File metadata

  • Download URL: chem_mat_database-1.1.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for chem_mat_database-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5aaf31d74db87f7194524d028ecba21c7ae995134a8cd254e45aa439da7ce696
MD5 0af128dbc4e4365a2331616273e46626
BLAKE2b-256 62ef8f4cdcfc55d5cbdc0609f0f4bbf5fa531f6298cfa62e3ad05f19fee1db4b

See more details on using hashes here.

File details

Details for the file chem_mat_database-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chem_mat_database-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84cee4dc8042eea6c44653ec1b7404c5d9e3db034449868e54bc8c71bd63af4d
MD5 b8319019167077845c90db8c68bc427e
BLAKE2b-256 9de5a167fb13c57f3178de1655dff2cdbc48e39baa84f406fda308c4165f6052

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page