C5.0 Decision Tree Classifier — a pure-Python, scikit-learn-compatible implementation
Project description
c5tree 🌳
C5.0 Decision Tree Classifier for Python — a pure-Python, scikit-learn-compatible implementation of Ross Quinlan's C5.0 algorithm.
Why C5.0?
scikit-learn's DecisionTreeClassifier uses CART (binary splits, Gini/entropy). C5.0 offers several advantages:
| Feature | CART (sklearn) | C5.0 (c5tree) |
|---|---|---|
| Split criterion | Gini / Entropy | Gain Ratio |
| Categorical splits | Binary only | Multi-way |
| Missing values | Requires imputation | Native support |
| Pruning | Cost-complexity | Pessimistic Error Pruning |
| Tree size | Often larger | Smaller, more interpretable |
Installation
pip install c5tree
Quick Start
from c5tree import C5Classifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = C5Classifier(pruning=True, cf=0.25)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
print(f"Tree depth: {clf.get_depth()}")
print(f"Leaves: {clf.get_n_leaves()}")
Key Features
Gain Ratio Splitting
Corrects ID3's bias toward features with many distinct values by normalising information gain by split information.
Native Missing Value Handling
No imputation needed. Missing instances are distributed fractionally across branches, weighted by the proportion of known instances going each way.
import numpy as np
X_with_missing = X.copy().astype(float)
X_with_missing[0, 2] = np.nan # inject a missing value
clf.fit(X_with_missing, y) # works out of the box
clf.predict(X_with_missing) # also works
Categorical Feature Support
Pass a pandas DataFrame and c5tree automatically detects object/category columns and applies multi-way splits:
import pandas as pd
df = pd.DataFrame({
"outlook": ["sunny", "overcast", "rainy", "sunny", "rainy"],
"humidity": [85, 65, 70, 95, 80],
"play": [0, 1, 1, 0, 1],
})
X = df[["outlook", "humidity"]]
y = df["play"]
clf = C5Classifier(pruning=False).fit(X, y)
Pessimistic Error Pruning
After the tree is grown, subtrees are replaced by leaves if the pessimistic error estimate of the subtree is no better than a single leaf. Controlled by the cf parameter.
# cf=0.25 → default, moderate pruning
# cf=0.05 → aggressive pruning, very small tree
# cf=0.50 → light pruning, larger tree
clf = C5Classifier(pruning=True, cf=0.05)
Human-Readable Tree
print(clf.text_report())
# [feature_2 <= 1.9000]
# left:
# → Predict: 0 [0: 1.00, 1: 0.00, 2: 0.00] (n=40.0)
# right:
# [feature_3 <= 1.7500]
# ...
Full sklearn Compatibility
Works in Pipelines, GridSearchCV, cross_val_score, and clone:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", C5Classifier()),
])
param_grid = {"clf__cf": [0.05, 0.25, 0.50], "clf__max_depth": [None, 5, 10]}
search = GridSearchCV(pipe, param_grid, cv=5)
search.fit(X_train, y_train)
print(search.best_params_)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
max_depth |
int or None | None | Maximum tree depth |
min_samples_split |
int | 2 | Minimum samples to split a node |
min_samples_leaf |
int | 1 | Minimum samples at a leaf |
pruning |
bool | True | Enable pessimistic error pruning |
cf |
float | 0.25 | Confidence factor for pruning (0.05–0.50) |
min_gain_ratio |
float | 0.0 | Minimum gain ratio to make a split |
Running Tests
pip install c5tree[dev]
pytest tests/ -v --cov=c5tree
Background
C5.0 is the successor to C4.5 and ID3, developed by Ross Quinlan. The algorithm was open-sourced in 2011 under the GPL licence. This package is a clean pure-Python reimplementation, making C5.0 accessible to the Python data-science ecosystem for the first time as a proper scikit-learn-compatible estimator.
Reference: Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Contributing
Contributions are very welcome! Please open an issue before submitting a PR.
Licence
GNU General Public License v3.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file c5tree-0.1.0.tar.gz.
File metadata
- Download URL: c5tree-0.1.0.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17b8626e453c86589e4d12feef25261d3974b6732284ce478e16ac35e40c302c
|
|
| MD5 |
293a4b80625c99ffb1eb73ba2a440a21
|
|
| BLAKE2b-256 |
c09463652554f2b3eba63a64bd585b4155fd63bb794737368d49db4d3383b28f
|
File details
Details for the file c5tree-0.1.0-py3-none-any.whl.
File metadata
- Download URL: c5tree-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da2ebd4fee7b359e5492d06eae607e5160eada1927e9376be54ad0d29790ecb3
|
|
| MD5 |
911303139c10abad881473292e35d9d1
|
|
| BLAKE2b-256 |
481270a1eb603546c7208dac1bc250ad00fe6469cb85f0ce723e46dd517b9ae1
|