Code for Kaggle and Offline Competitions.

These details have not been verified by PyPI

Project links

Homepage

Project description

nyaggle

GitHub Actions CI Status Python Versions

nyaggle is a utility library for Kaggle and offline competitions, particularly focused on feature engineering and validation. See the documentation for details.

Feature Engineering
- K-Fold Target Encoding
- BERT Sentence Vectorization
Model Validation
- CV with OOF
- Adversarial Validation
Experiment
- Minimal experiment logging which can be combined with mlflow
- GBDT experiment wrapper
  - Output CV score, submission.csv, OOF, importance plot at once
Ensemble
- Blending

Installation

You can install nyaggle via pip:

$pip install nyaggle

Examples

Feature Engineering

Target Encoding with K-Fold

import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(split=kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])

Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. MaCab and mecab-python3 are also required if you use Japanese BERT model.

import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)

Model Validation

cross_validate() provides handy API to calculate K-fold CV, Out-of-Fold prediction and test prediction at one time. You can pass LGBMClassifier/LGBMRegressor and any other sklearn models.

import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

from nyaggle.validation import cross_validate

X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)

models = [LGBMClassifier(n_estimators=300) for _ in range(5)]

pred_oof, pred_test, scores = cross_validate(models, X[:512, :], y[:512], X[512:, :], nfolds=5,
                                             eval=roc_auc_score)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.6

Jul 13, 2023

0.1.5

Oct 30, 2021

0.1.4

Sep 9, 2020

0.1.3

May 28, 2020

0.1.2

Mar 1, 2020

0.1.1

Feb 17, 2020

0.1.0

Feb 6, 2020

0.0.11

Jan 27, 2020

0.0.10

Jan 25, 2020

0.0.9

Jan 25, 2020

0.0.8

Jan 24, 2020

0.0.7

Jan 23, 2020

0.0.6

Jan 23, 2020

0.0.5

Jan 14, 2020

This version

0.0.4

Jan 9, 2020

0.0.3

Dec 31, 2019

0.0.2

Dec 28, 2019

0.0.1

Dec 24, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nyaggle-0.0.4.tar.gz (3.8 kB view hashes)

Uploaded Jan 9, 2020 Source

Built Distribution

nyaggle-0.0.4-py3-none-any.whl (4.5 kB view hashes)

Uploaded Jan 9, 2020 Python 3

Hashes for nyaggle-0.0.4.tar.gz

Hashes for nyaggle-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`7c865c46f9108532a0af822ede86ede128eacf814933931b0409faf7bc0529aa`
MD5	`9e4d4808d1bac656f4269609c47a6e66`
BLAKE2b-256	`7a6812c7d7661549b0d2c781e7b6fd2037e95fbe715f58f624e5ab60cef90e78`

Hashes for nyaggle-0.0.4-py3-none-any.whl

Hashes for nyaggle-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b858672b0d998a16b62ffc4c0d5bdead7fbfa8de74645391515d3d6222cfacaf`
MD5	`405af399c8a9620a4222c6ec85069be9`
BLAKE2b-256	`3e263e49967c731e410f0b421b4d4d374350d54eb8c1a5346e694c279edbe81c`