Skip to main content

Command Line Interface for projects

Project description

made-with-python python-version ruff zenodo

⚗️ ChemMatData

ChemMatData Logo

The chem_mat_data package provides easy access to a large range of property prediction datasets from Chemistry and Material Science. The aim of this package is to provide the datasets in a unified format suitable to machine learning applications and specifically to train graph neural networks (GNNs).

Specifically, chem_mat_data addresses these aims by providing simple, single-line command line (CLI) and programming (API) interfaces to download datasets either in raw or in processed (graph) format.

Features:

Getting ready to train a PyTorch Geometric model can be as easy as this:

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')

# Convert the graph dicts into PyG Data objects
data_list: list[Data] = pyg_data_list_from_graphs(graphs)
data_loader: DataLoader = DataLoader(data_list, batch_size=32, shuffle=True)

# Network training...

📦 Pip Installation

Install the latest stable release using pip from the Python Package Index (PyPI):

pip install chem_mat_database

Or install the latest development versin directly from the GitHub repository:

pip install git+https://github.com/the16thpythonist/chem_mat_data.git

⌨️ Command Line Interface (CLI)

The package provides the cmdata command line interface (CLI) to interact with the remote database.

To see the list of all available commands, simply use the --help flag:

cmdata --help

Listing Available Datasets

To which datasets are available to be downloaded from the remote file share server, use the list command:

cmdata list

This will print a table containing all the dataset which are currently available to download from the database. Each row of the table represents one dataset and contains the name of the dataset, the number of molecules in the dataset and the number of target properties as additional columns.

Downloading Datasets

Finally, to download this dataset, use the download command:

cmdata donwload "clintox"

This will download the dataset clintox.csv dataset file to your current working directory.

One can also specify the path to wich the dataset should be downloaded as following:

cmdata download --path="/tmp" "clintox"

🚀 Quickstart

Alternatively, the chem_mat_data functionality can be used programmatically as part of python code. The package provides each dataset either in raw or processed/graph format (For further information on the distincation visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/api_datasets/)).

Raw Datasets

You can use the load_smiles_dataset function to download the raw dataset format. This function will return the dataset as a pandas.DataFrame object which contains a “smiles” column along with the specific target value annotations as separate data frame columns.

import pandas as pd
from chem_mat_data import load_smiles_dataset

df: pd.DataFrame = load_smiles_dataset('clintox')
print(df.head())

Graph Datasets

You can also use the load_graph_dataset function to download the same dataset in the pre-processed graph representation. This function will return a list of dict objects which contain the full graph representation of the corresponding molecules.

from rich.pretty import pprint
from chem_mat_data import load_graph_dataset

graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

For further information on the graph representation, visit the [Documentation](https://the16thpythonist.github.io/chem_mat_data/graph_representation/).

Training Graph Neural Networks

Finally, the following code snippet demonstrates how to train a graph neural network (GNN) model using the PyTorch Geometric library with the dataset loaded from the chem_mat_data package.

from torch import Tensor
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn.models import GIN
from rich.pretty import pprint

from chem_mat_data import load_graph_dataset, pyg_data_list_from_graphs

# Load the dataset of graphs
graphs: list[dict] = load_graph_dataset('clintox')
example_graph = graphs[0]
pprint(example_graph)

# Convert the graph dicts into PyG Data objects
data_list = pyg_data_list_from_graphs(graphs)
data_loader = DataLoader(data_list, batch_size=32, shuffle=True)

# Construct a GNN model
model = GIN(
    in_channels=example_graph['node_attributes'].shape[1],
    out_channels=example_graph['graph_labels'].shape[0],
    hidden_channels=32,
    num_layers=3,
)

# Perform model forward pass with a batch of graphs
data: Data = next(iter(data_loader))
out_pred: Tensor = model.forward(
    x=data.x,
    edge_index=data.edge_index,
    batch=data.batch
)
pprint(out_pred)

🤝 Credits

We thank the following packages, institutions and individuals for their significant impact on this package.

  • PyComex is a micro framework which simplifies the setup, processing and management of computational experiments. It is also used to auto-generate the command line interface that can be used to interact with these experiments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mat_database-1.9.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chem_mat_database-1.9.0-py3-none-any.whl (605.7 kB view details)

Uploaded Python 3

File details

Details for the file chem_mat_database-1.9.0.tar.gz.

File metadata

  • Download URL: chem_mat_database-1.9.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for chem_mat_database-1.9.0.tar.gz
Algorithm Hash digest
SHA256 21220a6826ab47974ace2c360676542da6d36b2895c8e9d51e8e2cf272a540e5
MD5 177ef92497a329838b1695607f02e7f1
BLAKE2b-256 ff1ef52c9afd11d10469e4e7be044b22242f45cf348e5e054595389f5fabd946

See more details on using hashes here.

File details

Details for the file chem_mat_database-1.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chem_mat_database-1.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f3640814c188706b62caa4c3f58e28cc0ddb43c657af74281aa711c5e4fcd8e
MD5 57b75c81290caeaaffbf7b81728a8b13
BLAKE2b-256 dba972033cc9758320b3a14b1a4b2cdd536c5ff57d9194c1a46d9285f8f7df6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page