Skip to main content

Build random forests using CUDA GPU.

Project description

Description

Build random forests for large data sets using CUDA. This is the GPU-enabled version of brif. The same program is available on CRAN for R users.

Build from source

Prerequisites

An Nvidia graphics / compute card must be present and the CUDA Toolkit must be installed.

For Windows, Microsoft Visual Studio Build Tools for C++ must be installed. For Linux and MacOS, some C++ build tool chain (e.g., gcc) is required.

Python build is required, can be installed via

pip install build

The pandas and numpy packages are required, can be installed via

pip install pandas numpy

Build and install on Windows

Clone (or download as zip and extract) this project to a local directory.

Search in the Windows search bar and run as administrator the "x64 Native Tools Command Prompt for VS 2022". In the command window thus opened, cd into the project root directory, and run

mkdir build
cd build
cmake ../

If successful, the file cubrif.sln (among other files) will be generated, then run

MSBuild.exe cubrif.sln /p:Configuration=Release

If successful, several files will be created in the Release subfolder. Important ones include cubrif.lib, cubrif.dll and cubrif_main.exe. cubrif.lib will be used in building python package, cubrif.dll will be used in runtime, and cubrif_main.exe is a standalone executable.

Copy cubrif.lib to the project root directory:

copy Release\cubrif.lib ..\

Now go back to the project root and build the Python package, as follows

cd ..
python -m build

If successful, the package, e.g., cubrif-1.4.0.tar.gz, will be create in the dist subfolder.

Install the package by

pip install dist/cubrif-1.4.0.tar.gz

To use the package, the cubrif.dll must be visible to python, for example:

import os
os.add_dll_directory("C:/path/to/project/build/Release")
from cubrif import cubrif

Build and install on Ubuntu

The build process is similar, but use 'make' instead of MSBuild.exe, and the dynamic library file generated will be libcubrif.so instead of cubrif.dll.

mkdir build
cd build
cmake ../
make
cp libcubrif.so ../
cd ..
python3 -m build
pip install dist/cubrif-1.4.0.tar.gz

In the above step, if "python3 -m build" does not work, use the equivalent command

python3 setup.py sdist bdist_wheel

To use the package, the libcubrif.so must be visible to python. Either copy libcubrif.so to usr/lib or use os.add_dll_directory() as described above. For example,

sudo cp libcubrif.so /usr/lib

or in python,

import os
os.add_dll_directory("C:/path/to/project/build/Release")

Usage Examples

from cubrif import cubrif
import pandas as pd

# Create a brif object with default parameters.
bf = cubrif.cubrif()  

# Display the current parameter values. 
bf.get_param()  

# To change certain parameter values, e.g.:
bf.set_param({'ntrees':10, 'nthreads':2, 'GPU':1})  

# Or simply:
bf.ntrees = 50

# Load input data frame. Data must be a pandas data frame with appropriate headers.
df = pd.read_csv("auto.csv")

# Train the model
bf.fit(df, 'origin')  # specify the target column name

# Or equivalently
bf.fit(df, 7)  # specify the target column index

# Make predictions 
# The target variable column must be excluded, and all other columns should appear in the same order as in training
# Here, predict the first 10 rows of df
pred_labels = bf.predict(df.iloc[0:10, 0:7], type='class')  # return a list containing the predicted class labels
pred_scores = bf.predict(df.iloc[0:10, 0:7], type='score')  # return a data frame containing predicted probabilities by class

# Note: for a regression problem (i.e., when the response variable is numeric type), the predict function will always return a list containing the predicted values

Parameters

tmp_preddata a character string specifying a filename to save the temporary scoring data. Default is "tmp_brif_preddata.txt".

n_numeric_cuts an integer value indicating the maximum number of split points to generate for each numeric variable.

n_integer_cuts an integer value indicating the maximum number of split points to generate for each integer variable.

max_integer_classes an integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem.

max_depth an integer specifying the maximum depth of each tree. Maximum is 40.

min_node_size an integer specifying the minimum number of training cases a leaf node must contain.

ntrees an integer specifying the number of trees in the forest.

ps an integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input.

max_factor_levels an integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended.

seed a positive integer, random number generator seed.

nthreads an integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP.

blocksize an integer specifying the CUDA thread block size. Must be a multiple of 64, and no more than 1024.

GPU an integer (0, 1 or 2). 0: Do not use the GPU (for small datasets, e.g., less than 100,000 rows, using GPU is slower). 1: Force use the GPU. 2: Use GPU to evaluate splits only when the node size is greater than or equal to n_lb_GPU.

n_lb_GPU an integer specifying the threshold number of rows in the training data to use GPU for training. This parameter takes effect only when GPU = 2.

vote_method an integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight.

na_numeric a numeric value, substitute for 'nan' in numeric variables.

na_integer an integer value, substitute for 'nan' in integer variables.

na_factor a character string, substitute for missing values in factor variables.

type a character string indicating the return content of the predict function. For a classification problem, "score" means the by-class probabilities and "class" means the class labels (i.e., the target variable levels). For regression, the predicted values are returned. This is a parameter for the predict function, not an attribute of the brif object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cubrif-1.4.3.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cubrif-1.4.3-cp310-cp310-win_amd64.whl (16.1 kB view details)

Uploaded CPython 3.10Windows x86-64

File details

Details for the file cubrif-1.4.3.tar.gz.

File metadata

  • Download URL: cubrif-1.4.3.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for cubrif-1.4.3.tar.gz
Algorithm Hash digest
SHA256 b236aef7f0c5c3eac69142a9add002637a29d518e820580bf5e5c7eb4dbe6228
MD5 dea997d571d209a022cb6b2f0a6b86ca
BLAKE2b-256 d7bc3316619fb2e4983e3519a0cd634ce10adf1e573491b4476c65d2d07ecebf

See more details on using hashes here.

File details

Details for the file cubrif-1.4.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cubrif-1.4.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for cubrif-1.4.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 adda85d3583c50d00fd20f60172215e75c619f3c7a90822f9287baa9cfd64674
MD5 6da5615e8a94c24b16dcc21f8bdc4f0a
BLAKE2b-256 9cd6b0d4afdc65ee2f315c8a8b6e283541520c263ac10b4c49f261d5497cde6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page