Skip to main content

JoinBoost: In-Database Tree-Models over Many Tables

Project description

JoinBoost: In-Database Tree-Models over Many Tables

License

JoinBoost is a Python library to help you train tree-models (decision trees, gradient boosting, random forests).

Note that many functionalities of JoinBoost are still under development. If you are interested in using JoinBoost, we are happy to provide direct supports. You can contact us through issues, or email zh2408@columbia.edu

Why JoinBoost?

JoinBoost algorithms follow LightGBM. However, JoinBoost trains models

  1. Inside Database: JoinBoost translates ML algorithms into SQL queries, and directly executes these queries in your databases. This means:
    • Safety: There is no data movement.
    • Transformation: You can directly do OLAP and data transformation in SQL.
    • Scalability: In-DB ML is natively out-of-core, and JoinBoost could be connected to distributed databases.
  2. Many tables: JoinBoost applies Factorized Learning with optimized algorithms. Therefore, JoinBoost trains a model over the join result of many tables but without materializing the join. This provides large performance improvement and is super convenient.

Start JoinBoost

The easiest way to install JoinBoost is using pip:

pip install joinboost

JoinBoost APIs are similar to Sklearn, Xgboost and LightGBM. The main difference is that JoinBoost datasets are specified by database connector and join graph schema. Below, we specify a join graph of two tables sales and items:

import duckdb
from joinboost.joingraph import JoinGraph
from joinboost.app import DecisionTree

# DuckDB connector
con = duckdb.connect(database='duckdb')

dataset = JoinGraph(con)
dataset.add_relation("sales", [], y = 'total_sales')
dataset.add_relation("items", ["family","class","perishable"])
dataset.add_join("sales", "items", ["item_nbr"], ["item_nbr"])

reg = DecisionTree(learning_rate=1, max_leaves=8)
reg.fit(dataset)

Please Check out this notebook for Demo

For dev: https://gitpod.io/new#https://github.com/zachary62/JoinBoost

Docs

Documentation is currently under development. To build docs locally, download Sphinx and run

make html

in the folder docs. The docs will be generated in the folder docs/build/html.

Reproducibility

The technical report of JoinBoost could be found under /technical directory.

We note that some optimizations discussed in the paper (e.g., inter-query parallelism, DP) are still under development in the main codes. To reproduce the experiment results from the paper, we include the prototype codes for JoinBoost under /proto folder, which includes all the optimizations. We also include Jupyter Notebook to help you use these codes to train models over Favorita.

The Favorita dataset is too large to store in Github. Please download files from https://www.dropbox.com/sh/ymwn98pvederw6x/AAC-z6R_rKvU40KZDCyitjsda?dl=0 and uncompress the files.

The modified DuckDB to support column swap is at https://anonymous.4open.science/r/duckdb-D056.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinboost-0.0.1523.tar.gz (116.6 kB view details)

Uploaded Source

Built Distribution

joinboost-0.0.1523-py3-none-any.whl (117.8 kB view details)

Uploaded Python 3

File details

Details for the file joinboost-0.0.1523.tar.gz.

File metadata

  • Download URL: joinboost-0.0.1523.tar.gz
  • Upload date:
  • Size: 116.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for joinboost-0.0.1523.tar.gz
Algorithm Hash digest
SHA256 2190ddc3d7b88ce8980ced45a7319c0f84f044a322c5221c1a9991d04ade2033
MD5 75ae61f839aeb1fef996c502ecd98e4e
BLAKE2b-256 dde418cc659478a047ca35376a93b71b4cb22ef8efde95eb60ef9d695d7a947e

See more details on using hashes here.

File details

Details for the file joinboost-0.0.1523-py3-none-any.whl.

File metadata

  • Download URL: joinboost-0.0.1523-py3-none-any.whl
  • Upload date:
  • Size: 117.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for joinboost-0.0.1523-py3-none-any.whl
Algorithm Hash digest
SHA256 63e5369b6a1eb0821007b12eb0e631678e40a573a2a9388b63079f009e2c5fbe
MD5 9f69c6053e8771fa1a4c13a931f10a92
BLAKE2b-256 37f665aa1fb3b499ae3852484e815941735acb8a4eb627058a1c174717024422

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page