Semi-Supervised Classification and Regression Tree algorithm
Project description
SemiCART: Semi-Supervised Decision Tree Algorithm
SemiCART is a semi-supervised decision tree algorithm that enhances the traditional Classification and Regression Tree (CART) algorithm by incorporating semi-supervised learning principles. Published in the International Journal of Machine Learning and Cybernetics, our approach addresses a critical limitation of standard CART algorithms by leveraging unlabeled data in the training process.
🚀 Quick Links
- Installation - Get started with SemiCART
- Quick Start - Simple example to get you started
- Benchmark Results - See performance comparisons
- How It Works - Learn about the algorithm
- Examples - Detailed usage examples
📋 Table of Contents
- Overview
- Key Features
- When to Use SemiCART
- Installation
- Quick Start
- How It Works
- Performance Visualization
- Benchmark Results
- Parameters
- Command-Line Interface
- Benchmarking
- Examples
- Advantages
- Requirements
- Troubleshooting
- Citation
- Contributing
- Community
- License
Overview
Decision trees like CART form the foundation of modern boosting methodologies such as GBM, XGBoost, and LightGBM. However, standard CART algorithms can't learn from unlabeled data. SemiCART introduces "Distance-based Weighting," which leverages principles from graph-based semi-supervised learning to:
- Calculate relevance of training records relative to test data
- Remove irrelevant records to accelerate training
- Improve overall performance through modified Gini index calculations
Our comprehensive evaluations across thirteen datasets from various domains demonstrate that SemiCART consistently outperforms standard CART methods, offering a significant contribution to statistical learning.
Key Features
- Distance-based Weighting: Assigns weights to training instances based on their similarity to test instances, focusing the model on more relevant training data.
- Modified Gini Index: Incorporates instance weights into the splitting criteria, improving the decision tree's structure.
- scikit-learn Compatible: Fully compatible with the scikit-learn API, making it easy to integrate into existing ML pipelines.
- Multiple Distance Metrics: Supports a wide range of distance metrics (euclidean, manhattan, cosine, etc.)
- Comprehensive Benchmarking: Includes a benchmarking module for performance evaluation.
- Cost-Effective Learning: Efficiently utilizes both labeled and unlabeled data, reducing the need for expensive data labeling.
When to Use SemiCART
SemiCART is particularly effective in scenarios where:
- You have limited labeled data but abundant unlabeled data
- There's a significant cost associated with data labeling
- You're working with datasets where traditional decision trees show high variance
- Your data comes from domains like medical diagnostics, fraud detection, or customer segmentation
- You require models with good interpretability, unlike black-box models
- You want to incorporate the structure of unlabeled data into your classification model
SemiCART's advantage increases with higher ratios of unlabeled to labeled data, making it ideal for semi-supervised learning tasks.
Installation
From PyPI (Recommended)
pip install semicart
From Source (Latest Development Version)
git clone https://github.com/WeightedAI/semicart.git
cd semicart
pip install -e .
Quick Start
from semicart import SemiCART
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load and prepare data
X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the Semi-CART model
model = SemiCART(k_neighbors=3, distance_metric='euclidean')
model.fit(X_train, y_train, X_test)
# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")
# Output: Accuracy: 0.9778
Distance-based Weighting
SemiCART introduces a novel approach to incorporate test data into the training phase, inspired by graph-based semi-supervised learning techniques:
- For each test instance, distances to all training instances are calculated
- The k-nearest training instances are identified for each test instance
- Weights of these nearest training instances are incremented
- Training instances with zero weight (not selected as neighbors) are removed
- This focuses the model on the most relevant training data relative to the test set
Modified Gini Index
SemiCART replaces the standard class proportions in the Gini index with weight-based proportions:
Modified Gini = 1 - Σ(w_i/S)²
Where:
- w_i = sum of weights of instances in class i
- S = total sum of weights in the subset
This modified splitting criterion ensures that the resulting decision tree better captures the underlying relationships between labeled and unlabeled data.
Performance Visualization
SemiCART consistently outperforms traditional CART across various evaluation metrics:
Accuracy Comparison
AUC Comparison
F1 Score Comparison
These visualizations demonstrate SemiCART's superior performance across multiple datasets, particularly when leveraging unlabeled data effectively.
Benchmark Results: CART vs SemiCART
Our extensive benchmarking across multiple datasets shows that SemiCART consistently outperforms traditional CART in classification tasks:
Accuracy Improvements
| Dataset | Test Size | k | Best Distance Metric | CART | SemiCART | Improvement |
|---|---|---|---|---|---|---|
| banknote | 0.1 | 2 | hamming | 0.9928 | 1.0000 | +0.0072 |
| banknote | 0.7 | 3 | yule | 0.9594 | 0.9875 | +0.0281 |
| fertility | 0.1 | 1 | jaccard | 0.7000 | 0.9000 | +0.2000 |
| fertility | 0.3 | 5 | jensenshannon | 0.6333 | 0.8333 | +0.2000 |
| wdbc | 0.1 | 3 | sqeuclidean | 0.9298 | 1.0000 | +0.0702 |
| wdbc | 0.3 | 7 | cosine | 0.9006 | 0.9825 | +0.0819 |
| glass | 0.1 | 18 | yule | 0.6364 | 0.8636 | +0.2273 |
| glass | 0.2 | 18 | sqeuclidean | 0.7209 | 0.8837 | +0.1628 |
| transfusion | 0.1 | 6 | chebyshev | 0.7067 | 0.7733 | +0.0667 |
AUC Improvements
| Dataset | Test Size | k | Best Distance Metric | CART | SemiCART | Improvement |
|---|---|---|---|---|---|---|
| fertility | 0.2 | 13 | jaccard | 0.4722 | 0.9444 | +0.4722 |
| fertility | 0.5 | 11 | jensenshannon | 0.4545 | 0.7273 | +0.2727 |
| wdbc | 0.1 | 3 | sqeuclidean | 0.9147 | 1.0000 | +0.0853 |
| wdbc | 0.3 | 7 | cosine | 0.8892 | 0.9797 | +0.0905 |
| glass | 0.1 | 3 | yule | 0.8137 | 0.9386 | +0.1249 |
| glass | 0.7 | 12 | hamming | 0.7189 | 0.8346 | +0.1157 |
Key observations:
- SemiCART shows greatest improvements with smaller test sizes (more unlabeled data)
- Different distance metrics work best for different datasets
- Significant improvements even on datasets with complex decision boundaries
- Some datasets show dramatic improvements in AUC (up to +0.4722)
Parameters
max_depth: Maximum depth of the tree (default=None)min_samples_split: Minimum samples required to split a node (default=2)k_neighbors: Number of nearest neighbors to consider for weight assignment (default=1)distance_metric: Distance metric for similarity calculation (default='euclidean')- Supported values: 'euclidean', 'manhattan', 'cosine', 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'dice', 'hamming', 'jaccard', 'jensenshannon', 'minkowski', 'sqeuclidean', 'yule'
initial_weight: Initial weight for each training instance (default=1.0)weight_increment: Weight increment for nearest neighbors (default=1.0)random_state: Random seed for reproducibility (default=None)log_level: Logging level (default=logging.INFO)
Command-Line Interface
SemiCART includes a convenient command-line interface for quick experimentation:
# Run with default parameters on Iris dataset
semicart
# Run with custom parameters
semicart --dataset wine --test-size 0.4 --k-neighbors 5 --distance-metric manhattan
# Get help on available options
semicart --help
Benchmarking
SemiCART includes a comprehensive benchmarking module for evaluating performance:
from semicart.benchmark import run_default_benchmark
# Run a default benchmark on common datasets
runner = run_default_benchmark()
# Or create a custom benchmark
from semicart.benchmark import BenchmarkRunner
runner = BenchmarkRunner(output_dir='my_results')
runner.run_comparison(
dataset_names=['iris', 'wine'],
test_sizes=[0.3, 0.5],
k_neighbors_values=[1, 3, 5],
distance_metrics=['euclidean', 'manhattan']
)
Examples
Check out the examples directory for more detailed usage examples:
simple_example.py: Basic comparison with standard CARTdistance_metrics_comparison.py: Comparing different distance metricsadvanced_usage.py: More advanced options and configurations
Basic Comparison with scikit-learn's DecisionTreeClassifier
from semicart import SemiCART
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load and prepare data
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train standard CART
cart = DecisionTreeClassifier(random_state=42)
cart.fit(X_train, y_train)
cart_pred = cart.predict(X_test)
cart_acc = accuracy_score(y_test, cart_pred)
# Train SemiCART
semicart = SemiCART(k_neighbors=5, distance_metric='euclidean', random_state=42)
semicart.fit(X_train, y_train, X_test)
semicart_pred = semicart.predict(X_test)
semicart_acc = accuracy_score(y_test, semicart_pred)
print(f"CART Accuracy: {cart_acc:.4f}")
print(f"SemiCART Accuracy: {semicart_acc:.4f}")
print(f"Improvement: {semicart_acc - cart_acc:.4f}")
Advantages
- Improved Accuracy: SemiCART consistently outperforms CART on a wide range of datasets
- Utilizes Unlabeled Data: Leverages unlabeled instances to enhance the learning process
- Cost-Effective: Reduces the need for extensive data labeling
- Flexibility: Works with various distance metrics to adapt to different data distributions
- Interpretability: Maintains the interpretability of decision trees
- Integration: Easily integrates into existing ML pipelines through scikit-learn compatibility
- Domain-Agnostic: Performs well across various domains and data types
Requirements
SemiCART requires the following dependencies:
- Python ≥ 3.7
- NumPy ≥ 1.19.0
- scikit-learn ≥ 0.24.0
- SciPy ≥ 1.6.0
- pandas ≥ 1.0.0
Compatible with all major operating systems (Windows, macOS, Linux).
Troubleshooting
Common Issues
ImportError: No module named 'semicart'
- Make sure you've installed the package with
pip install semicart - Verify your Python environment is activated if using virtual environments
AttributeError when using SemiCART with custom datasets
- Ensure your data is properly formatted (numerical, no NaN values)
- Check that feature scaling is applied for distance-based metrics
Poor performance on specific datasets
- Try different distance metrics (results vary by dataset characteristics)
- Adjust the k_neighbors parameter (often 3-7 works well for most datasets)
- Ensure proper feature scaling is applied
For more help, please open an issue on our GitHub repository.
Citation
If you use SemiCART in your research, please cite the following paper:
@article{abedinia2024semicart,
title={Building Semi-Supervised Decision Trees with Semi-CART Algorithm},
author={Abedinia, Aydin and Seydi, Vahid},
journal={International Journal of Machine Learning and Cybernetics},
volume={15},
pages={4493--4510},
year={2024},
publisher={Springer},
doi={10.1007/s13042-024-02161-z}
}
Contributing
Contributions to SemiCART are welcome! Please check our contributing guidelines for more details.
Development Setup
- Fork the repository on GitHub
- Clone your fork locally
- Create a virtual environment and install development dependencies:
pip install -e ".[dev]"
- Create a branch for your changes
- Make your changes and add tests
- Run tests locally:
pytest
- Submit a pull request
Community
- GitHub Issues: For bug reports and feature requests
- Discussions: For usage questions and discussions
- Pull Requests: For contributing code and documentation
Join our community of data scientists and machine learning practitioners to improve SemiCART and expand its capabilities!
License
SemiCART is released under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semicart-1.0.0.tar.gz.
File metadata
- Download URL: semicart-1.0.0.tar.gz
- Upload date:
- Size: 30.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3e8e3bee4b86428f0da765627a945bed95e54a45aa6cf170b711f321484dd57
|
|
| MD5 |
ede919f01cbd8162808a6703b6b980dd
|
|
| BLAKE2b-256 |
48b431f67a07b28d5d3888a458b8a5ebbcb1b043c346f41b1fdb0d4413962c40
|
File details
Details for the file semicart-1.0.0-py3-none-any.whl.
File metadata
- Download URL: semicart-1.0.0-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e134760d2240e8b60d0988d209d38a71b5d1972b90cc7090c13a88a36846e35a
|
|
| MD5 |
650e0575713730056d3b8a84474d8b24
|
|
| BLAKE2b-256 |
93ab69d5fd4d0aa87cab61f41efaf4b5b65a0ac34c5e25440177ef3a06f7f922
|