Skip to main content

DBGym: platform for relational database benchmark RDBench

Project description

DBGym

Testing Status Linting Status

pip install dbgym	# Install DBGym

DBGym is a platform designed to facilitate ML research and application on databases. With less than 5 lines of code, you can point to your database, write the predictive query you want, and DBGym will output the predictions along with your database.

In the current release, DBGym focuses on relational databases (databases with multiple tables). In practice, it simply means that you would need to prepare your data into a directory with CSV or Parquet files, each representing a table. Then, you would need to prepare the column names in each file according to a standard naming convention:

For example, you may have the following files in your database directory

user.csv	item.csv	trans.csv

Then, the column names for each file should look like

user.csv:	_user, x1, x2, ...
item.csv:	_item, x1, x2, ...
trans.csv:	_trans, _user, _item, x1, x2, ...

It is worth noticing that the name of files and columns should preferably without ., and the column name begin with _ indicates a key value instead of feature.

Quick Start

With less than 5 lines of code, you can set up and carryout experiments using DBGym.

from dbgym.run import run
from dbgym.config import get_config

config = get_config()
stats = run(config)

Customize Configuration

For users who want to customize configuration, i.e. carry out experiment using a specific model on a specific dataset, we provide two ways to customize configuration: using code and using yaml file, respectively.

Code for Configuration

For example, we can set the target dataset using the following code:

from dbgym.run import run
from dbgym.config import get_config

config = get_config()
config.merge_from_list(['dataset.name', 'rdb1-ather'])
config.merge_from_list(['dataset.query', 'entry_examination.Cholesterol'])
stats = run(config)

File for Configuration

For example, we can set the model using the following code:

from dbgym.run import run
from dbgym.config import get_config

config = get_config()
config.merge_from_file('config_mlp.yaml')
stats = run(config)

And the yaml config_mlp.yaml is as follow:

# Example Experiment Configuration for MLP

train:
  epoch: 200

model:
  name: 'MLP'
  hidden_dim: 128
  layer: 4

optim:
  lr: 0.001

Customize Dataset

For users who want to customize dataset, i.e. carry out experiment on a given dataset, we provide the following code to customize the dataset configuration.

from dbgym.run import run
from dbgym.config import get_config

config = get_config()
config.merge_from_list(['dataset.dir', 'your_dataset_path'])
config.merge_from_list(['dataset.name', 'your_dataset_name'])
config.merge_from_list(['dataset.query', 'your_target_file.your_target_column'])
stats = run(config)

Where the directory of the dataset should be as follow:

tree your_dataset_path
your_dataset_path
├── your_dataset_name
│   ├── your_target_file
│   ├── other_file
│   └── ...
├── other_dataset
└── ...

Customize Model

For users who want to customize model, i.e. carry out experiment with a novel model on our proposed dataset, we provide the following code to customize the your own model and contribute to our reposity.

Customize your own model in dbgym/contribute/your_model.py:

from dbgym.register import register

class YourModel(nn.Module):
    ...

your_model_type = 'tabular_model' or 'graph_model'
register(your_model_type, 'your_model_name', YourModel)
from dbgym.run import run
from dbgym.config import get_config

config = get_config()
config.merge_from_list(['model.name', 'your_model_name'])
stats = run(config)

Highlights

DBGym has enabled RDBench, a user-friendly toolkit for benchmarking ML methods on relational databases. Please refer to our paper for more details: RDBench: ML Benchmark for Relational Databases, arXiv, Zizhao Zhang*, Yi Yang*, Lutong Zou*, He Wen*, Tao Feng, Jiaxuan You.


Figure 1: An overview of our proposed benchmark RDBench.

1. Unified Task Definition for Various Data Formats.

  • To meet the requirements of diverse users, DBGym provides 3 kinds of data: tabular data, homogeneous graphs and heterogeneous graphs.
  • For all these data formats, we propose a unified task definition, enabling results comparison between models for different formats of data.

2. Hierarchical Datasets with Comprehensive Experiments

  • DBGym provides 11 datasets with a range of scales, domains, and relationships.
  • These datasets are categorized into three groups based on their relationship complexity.
  • Extensive experiments with 10 baselines are carried out on these datasets.

3. Easy-to-use Interfaces with Robust Results

  • Highly modularized pipeline.
  • Reproducible experiment configuration.
  • Scalable experiment management.
  • Flexible user customization.
  • Results reported are averaged over the same dataset and same task type (classification or regression).

Logging

Visualisation

To see the visualisation of training process, go into the output directory and run

tensorboard --logdir .

Then visit localhost:6006, or any other port you specified.

Installation

DBGym is available for Python 3.10 and above.

pip install dbgym

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbgym-0.1.1.tar.gz (881.5 kB view hashes)

Uploaded Source

Built Distribution

dbgym-0.1.1-py3-none-any.whl (26.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page