JoinBoost: In-Database Tree-Models over Many Tables

Project description

JoinBoost: In-Database Tree-Models over Many Tables

JoinBoost is a Python library to help you train tree-models (decision trees, gradient boosting, random forests).

Why JoinBoost?

JoinBoost algorithms follow LightGBM. However, JoinBoost trains models

Inside Database: JoinBoost translates ML algorithms into SQL queries, and directly executes these queries in your databases. This means:
- Safety: There is no data movement.
- Transformation: You can directly do OLAP and data transformation in SQL.
- Scalability: In-DB ML is natively out-of-core, and JoinBoost could be connected to distributed databases.
Many tables: JoinBoost applies Factorized Learning with optimized algorithms. Therefore, JoinBoost trains a model over the join result of many tables but without materializing the join. This provides large performance improvement and is super convenient.

Start JoinBoost

The easiest way to install JoinBoost is using pip:

pip install joinboost

JoinBoost APIs are similar to Sklearn, Xgboost and LightGBM. The main difference is that JoinBoost datasets are specified by database connector and join graph schema. Below, we specify a join graph of two tables sales and items:

import duckdb
from joinboost.joingraph import JoinGraph
from joinboost.app import DecisionTree

# DuckDB connector
con = duckdb.connect(database='duckdb')

dataset = JoinGraph(con)
dataset.add_relation("sales", [], y = 'total_sales')
dataset.add_relation("items", ["family","class","perishable"])
dataset.add_join("sales", "items", ["item_nbr"], ["item_nbr"])

reg = DecisionTree(learning_rate=1, max_leaves=8)
reg.fit(dataset)

Please Check out this notebook for Demo

Reproducibility

The technical report of JoinBoost could be found under /technical directory.

We note that some optimizations discussed in the paper (e.g., inter-query parallelism, DP) are still under development in the main codes. To reproduce the experiment results from the paper, we include the prototype codes for JoinBoost under /proto folder, which includes all the optimizations. We also include Jupyter Notebook to help you use these codes to train models over Favorita.

The Favorita dataset is too large to store in Github. Please download files from https://www.dropbox.com/sh/ymwn98pvederw6x/AAC-z6R_rKvU40KZDCyitjsda?dl=0 and uncompress the files.

Project details

Release history Release notifications | RSS feed

0.0.1523

May 8, 2023

0.0.1522

May 8, 2023

0.0.1521

Mar 4, 2023

0.0.152

Mar 4, 2023

This version

0.0.151

Jan 12, 2023

0.0.15

Jan 12, 2023

0.0.14

Oct 19, 2022

0.0.13

Oct 19, 2022

0.0.11

Oct 19, 2022

0.0.1

Oct 19, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinboost-0.0.151.tar.gz (97.4 kB view hashes)

Uploaded Jan 12, 2023 Source

Built Distribution

joinboost-0.0.151-py3-none-any.whl (96.0 kB view hashes)

Uploaded Jan 12, 2023 Python 3

Hashes for joinboost-0.0.151.tar.gz

Hashes for joinboost-0.0.151.tar.gz
Algorithm	Hash digest
SHA256	`2c45ad384af049dad4ecf7221aaf8c0a64e57425ab8b043281a341d03ddaf1ae`
MD5	`d01b42665feefe2e510a6369dc0d5531`
BLAKE2b-256	`8fa57f21e20389a1a0adfeeb92d21a8fe7171c3493b5b873acb9548d872bde0c`

Hashes for joinboost-0.0.151-py3-none-any.whl

Hashes for joinboost-0.0.151-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad59b99e25a1f5ffa7257fd736281708d44e05a0659fc5a656bae66cb134a975`
MD5	`ac13a9f3100329dbf14b74672fff2aed`
BLAKE2b-256	`233d70cb277f727b8683c189078347ff9f9e0944aa1a7ddfc9d465db83546d0f`