Code for Kaggle Data Science Competitions.

Project description

Kaggler

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the version 3 of the GNU General Public License.

Its online learning algorithms are inspired by Kaggle user tinrtgu's code. It uses the sparse input format that handles large sparse data efficiently. Core code is optimized for speed by using Cython.

Installation

Dependencies

Python packages required are listed in requirements.txt

cython
h5py
numpy/scipy
pandas
scikit-learn
ml_metrics

Using pip

Python package is available at PyPi for pip installation:

(sudo) pip install -U Kaggler

If installation fails because it cannot find MurmurHash3.h, please add . to LD_LIBRARY_PATH as described here.

From source code

If you want to install it from source code:

python setup.py build_ext --inplace
sudo python setup.py install

Data I/O

Kaggler supports CSV (.csv), LibSVM (.sps), and HDF5 (.h5) file formats:

# CSV format: target,feature1,feature2,...
1,1,0,0,1,0.5
0,0,1,0,0,5

# LibSVM format: target feature-index1:feature-value1 feature-index2:feature-value2
1 1:1 4:1 5:0.5
0 2:1 5:1

# HDF5
- issparse: binary flag indicating whether it stores sparse data or not.
- target: stores a target variable as a numpy.array
- shape: available only if issparse == 1. shape of scipy.sparse.csr_matrix
- indices: available only if issparse == 1. indices of scipy.sparse.csr_matrix
- indptr: available only if issparse == 1. indptr of scipy.sparse.csr_matrix
- data: dense feature matrix if issparse == 0 else data of scipy.sparse.csr_matrix

from kaggler.data_io import load_data, save_data

X, y = load_data('train.csv')	# use the first column as a target variable
X, y = load_data('train.h5')	# load the feature matrix and target vector from a HDF5 file.
X, y = load_data('train.sps')	# load the feature matrix and target vector from LibSVM file.

save_data(X, y, 'train.csv')
save_data(X, y, 'train.h5')
save_data(X, y, 'train.sps')

Feature Engineering

One-hot, label, and target encoding

import numpy as np
import pandas as pd
from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder

trn = pd.read_csv('train.csv')
target_col = trn.columns[-1]
cat_cols = [col for col in trn.columns if trn[col].dtype == np.object]

ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences
lbe = LabelEncoder(min_obs=100)  # grouping all categories with less than 100 occurences
te = TargetEncoder()			 # replacing each category with the average target value for the category

X_trn = ohe.fit_transform(trn[cat_cols])	# X_cat is a scipy sparse matrix
trn.loc[:, cat_cols] = lbe.fit_transform(trn[cat_cols])
trn.loc[:, cat_cols] = te.fit_transform(trn[cat_cols])

tst = pd.read_csv('test.csv')
X_tst = ohe.transform(tst[cat_cols])
tst.loc[:, cat_cols] = lbe.transform(tst[cat_cols])
tst.loc[:, cat_cols] = te.transform(tst[cat_cols])

Ensemble

Netflix Blending

import numpy as np
from kaggler.ensemble import netflix
from kaggler.metrics import rmse

# Load the predictions of input models for ensemble
p1 = np.loadtxt('model1_prediction.txt')
p2 = np.loadtxt('model2_prediction.txt')
p3 = np.loadtxt('model3_prediction.txt')

# Calculate RMSEs of model predictions and all-zero prediction.
# At a competition, RMSEs (or RMLSEs) of submissions can be used.
y = np.loadtxt('target.txt')   
e0 = rmse(y, np.zeros_like(y)) 
e1 = rmse(y, p1)
e2 = rmse(y, p2)
e3 = rmse(y, p3)

p, w = netflix([e1, e2, e3], [p1, p2, p3], e0, l=0.0001) # l is an optional regularization parameter.

Algorithms

Currently algorithms available are as follows:

Online learning algorithms

Stochastic Gradient Descent (SGD)
Follow-the-Regularized-Leader (FTRL)
Factorization Machine (FM)
Neural Networks (NN) - with a single (NN) or two (NN_H2) ReLU hidden layers
Decision Tree

Batch learning algorithm

Neural Networks (NN) - with a single hidden layer and L-BFGS optimization

Examples

from kaggler.online_model import SGD, FTRL, FM, NN

# SGD
clf = SGD(a=.01,                # learning rate
          l1=1e-6,              # L1 regularization parameter
          l2=1e-6,              # L2 regularization parameter
          n=2**20,              # number of hashed features
          epoch=10,             # number of epochs
          interaction=True)     # use feature interaction or not

# FTRL
clf = FTRL(a=.1,                # alpha in the per-coordinate rate
           b=1,                 # beta in the per-coordinate rate
           l1=1.,               # L1 regularization parameter
           l2=1.,               # L2 regularization parameter
           n=2**20,             # number of hashed features
           epoch=1,             # number of epochs
           interaction=True)    # use feature interaction or not

# FM
clf = FM(n=1e5,                 # number of features
         epoch=100,             # number of epochs
         dim=4,                 # size of factors for interactions
         a=.01)                 # learning rate

# NN
clf = NN(n=1e5,                 # number of features
         epoch=10,              # number of epochs
         h=16,                  # number of hidden units
         a=.1,                  # learning rate
         l2=1e-6)               # L2 regularization parameter

# online training and prediction directly with a libsvm file
for x, y in clf.read_sparse('train.sparse'):
    p = clf.predict_one(x)      # predict for an input
    clf.update_one(x, p - y)    # update the model with the target using error

for x, _ in clf.read_sparse('test.sparse'):
    p = clf.predict_one(x)

# online training and prediction with a scipy sparse matrix
from kaggler import load_data

X, y = load_data('train.sps')

clf.fit(X, y)
p = clf.predict(X)

Documentation

Package documentation is available at here

Project details

Release history Release notifications | RSS feed

0.9.15

Mar 6, 2022

0.9.14

Mar 5, 2022

0.9.13

Jun 12, 2021

0.9.12

Jun 12, 2021

0.9.11

Jun 10, 2021

0.9.10

Jun 8, 2021

0.9.9

Jun 4, 2021

0.9.8

Jun 2, 2021

0.9.7

Jun 1, 2021

0.9.6

May 15, 2021

0.9.5

May 18, 2021

0.9.4

May 2, 2021

0.9.3

May 1, 2021

0.9.2

May 1, 2021

0.9.1

May 1, 2021

0.9.0

Apr 29, 2021

0.8.13

Apr 15, 2021

0.8.12

Oct 15, 2020

0.8.11

Mar 30, 2020

0.8.10

Mar 17, 2020

0.8.9

Jan 21, 2020

0.8.8

Dec 11, 2019

0.8.7

Oct 9, 2019

0.8.6

Oct 3, 2019

0.8.5

Sep 30, 2019

0.8.4

Sep 25, 2019

0.8.3

Sep 25, 2019

0.8.2

Aug 5, 2019

0.8.1

Aug 3, 2019

0.8.0

Aug 3, 2019

0.7.0

May 17, 2019

0.6.9

Apr 26, 2019

This version

0.6.8

Apr 9, 2019

0.6.7

Apr 9, 2019

0.6.6

Apr 9, 2019

0.6.5

Apr 9, 2019

0.6.4

Mar 16, 2019

0.6.3

Jan 2, 2019

0.6.2

Dec 22, 2018

0.6.1

Jul 14, 2018

0.6.0

Jun 28, 2018

0.5.2

Mar 14, 2017

0.5.1

Mar 14, 2017

0.5.0

Jan 12, 2017

0.4.4

Nov 18, 2016

0.4.3

Oct 22, 2016

0.4.0

Sep 12, 2015

0.3.8

Apr 17, 2015

0.3.7

Feb 15, 2015

0.3.6

Feb 12, 2015

0.3.5

Feb 12, 2015

0.3.4

Feb 12, 2015

0.3.3

Feb 11, 2015

0.3.2

Feb 11, 2015

0.3.1

Feb 10, 2015

0.3.dev pre-release

Feb 10, 2015

0.2.0

Jan 29, 2015

0.1.1

Sep 25, 2014

0.1.0

Jul 22, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Kaggler-0.6.8.tar.gz (792.0 kB view hashes)

Uploaded Apr 9, 2019 Source

Built Distributions

Kaggler-0.6.8-cp36-cp36m-macosx_10_7_x86_64.whl (585.3 kB view hashes)

Uploaded Apr 9, 2019 CPython 3.6m macOS 10.7+ x86-64

Kaggler-0.6.8-cp27-cp27m-macosx_10_14_x86_64.whl (593.6 kB view hashes)

Uploaded Apr 9, 2019 CPython 2.7m macOS 10.14+ x86-64

Hashes for Kaggler-0.6.8.tar.gz

Hashes for Kaggler-0.6.8.tar.gz
Algorithm	Hash digest
SHA256	`72b4d465a36502f5af675f76ec52fe89347b91b109ac95d640cffe414972ad1c`
MD5	`6dbeb81f64c971fc48580eba1ae55b3b`
BLAKE2b-256	`15cc4f895e9dfdf50c8ea93c1187202634e5f68e367df9fc924c962cba804970`

Hashes for Kaggler-0.6.8-cp36-cp36m-macosx_10_7_x86_64.whl

Hashes for Kaggler-0.6.8-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm	Hash digest
SHA256	`11a231baef187f0612fc33f23d70b9e44c9d0a4ea52079f60d1c759760df75cc`
MD5	`2fabd47f7d306495940e2799cd47ebff`
BLAKE2b-256	`0a1b07c3b4523aaf8ecb51b617d700db363c0b6aee5cee66ef9564af70d12aa0`

Hashes for Kaggler-0.6.8-cp27-cp27m-macosx_10_14_x86_64.whl

Hashes for Kaggler-0.6.8-cp27-cp27m-macosx_10_14_x86_64.whl
Algorithm	Hash digest
SHA256	`d36247aa756687ae1be71bd44a445e5ae960362c9be16db88f986f0efd576782`
MD5	`aa5cb8a308630e5230f209af783cedf7`
BLAKE2b-256	`5265c9215569ed4cedbd883a8be5322406cf710b91dd3b601fe86c8aadd17cb7`