Skip to main content

Xgboost Label Encoding

Project description

xgboost-label-encoding

CI

xgboost-label-encoding provides small sklearn-style wrappers around xgboost.XGBClassifier for classification workflows where the target labels are strings or other non-numeric values.

XGBoost trains on numeric class labels. This package encodes y during fit, trains the underlying XGBoost classifier, and decodes predictions back to the original labels. It is intended to be used as a drop-in estimator in places where manually applying sklearn.preprocessing.LabelEncoder to the target would be awkward.

Installation

pip install xgboost_label_encoding

The package requires Python 3.8+ and installs against xgboost<2.

For local development:

pip install -r requirements_dev.txt
pip install -e .
make test

Usage

Use XGBoostClassifierWithLabelEncoding in place of xgboost.XGBClassifier:

from xgboost_label_encoding import XGBoostClassifierWithLabelEncoding

clf = XGBoostClassifierWithLabelEncoding(
    n_estimators=100,
    class_weight="balanced",
)

clf.fit(X_train, y_train)  # y_train may contain labels like "Healthy" or "HIV"

labels = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)
classes = clf.classes_

Most XGBoost classifier parameters are passed through unchanged. The wrapper adds these project-specific options:

  • class_weight: passed to sklearn.utils.class_weight.compute_sample_weight; if sample_weight is also supplied, the two weights are multiplied.
  • fail_if_nothing_learned: defaults to True; raises ValueError after fitting if all feature importances are zero.

Cross-Validated Fitting

XGBoostClassifierWithLabelEncodingWithCV combines label encoding with cross-validation over XGBoost parameters:

from sklearn.model_selection import StratifiedKFold
from xgboost_label_encoding import XGBoostClassifierWithLabelEncodingWithCV

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

clf = XGBoostClassifierWithLabelEncodingWithCV(
    cv=cv,
    max_num_trees=200,
    early_stopping_patience=10,
    class_weight="balanced",
)

clf.fit(X_train, y_train)

During fit, the CV wrapper:

  • builds a small default grid of learning_rate and min_child_weight values unless param_grid is provided;
  • runs xgboost.cv with early stopping for each parameter set;
  • selects the best parameter set and number of boosting rounds;
  • fits the final classifier on the full training data.

If the provided CV splitter accepts a groups argument, groups can be passed to fit.

Behavior And Limitations

  • Training data must contain at least two classes.
  • predict returns original labels, not encoded integers.
  • predict_proba returns one probability column per class in clf.classes_.
  • For pandas DataFrame inputs, feature names containing [, ], or < are renamed internally before reaching XGBoost. feature_names_in_ still exposes the original feature names, and the same renaming is applied during predict and predict_proba.
  • XGBoostCV is also available as a standalone helper for numeric-label XGBoost classification with CV-selected hyperparameters and tree count.

Development

Useful local commands:

make test
make lint
make docs
make dist

Changelog

0.0.1

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xgboost_label_encoding-0.0.7.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xgboost_label_encoding-0.0.7-py2.py3-none-any.whl (10.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file xgboost_label_encoding-0.0.7.tar.gz.

File metadata

  • Download URL: xgboost_label_encoding-0.0.7.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for xgboost_label_encoding-0.0.7.tar.gz
Algorithm Hash digest
SHA256 29bb4cc17dbc26ce12a644eae3fd2686bce5d8ba06eb22c80c743ed7fe31393c
MD5 2a14267fcd380dfa7f67d8ea0d47bb72
BLAKE2b-256 199ec063be874cb0447225cacb21836da620df5a4223beb2d1f221d80e4295de

See more details on using hashes here.

File details

Details for the file xgboost_label_encoding-0.0.7-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for xgboost_label_encoding-0.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4df30a411e15cdf9091d57a4542112330792d27e4a15d097caed705091f441f3
MD5 8e215c86279e9fc4821f5ce73e8129cf
BLAKE2b-256 cbf7f65347700925b2e69fa6c420d35e98f4cb9e3cdc74659803616f8719dc45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page