Enterprise-ready distributed and federated Random Forests with orchestration, strategy search, and optional differential privacy
Project description
Distributed Random Forest
Distributed Random Forest is a Python package for federated and distributed tree ensembles. It is designed for people who need more than "train a few local forests and concatenate the trees":
- realistic client partitioning, including non-IID splits
- multiple aggregation strategies instead of one hard-coded merge rule
- parallel client training backends
- structured reports for benchmarking and audits
- runnable examples, docs, tests, CI, and PyPI packaging
This README is used on both GitHub and PyPI. The top half is package-user focused so the PyPI page answers "why should I install this?" before it dives into repo internals.
Why This Implementation Is Different
Most distributed RF repositories are really experiment scripts. This one is a reusable package with a benchmarkable orchestration layer.
| Area | Typical paper-style repo | This implementation |
|---|---|---|
| Distributed workflow | manual scripts | FederatedRandomForest orchestration API |
| Client heterogeneity | mostly uniform splits | uniform, stratified, feature, sized, dirichlet, label_skew |
| Tree aggregation | one or two ranking rules | classic paper rules plus balanced, proportional, threshold, and auto search |
| Execution | sequential only | sequential, thread, and process backends |
| Reporting | print statements | JSON run reports with partition, client, and strategy summaries |
| Privacy | often omitted | built-in DP RF support for experimentation and comparison |
| Packaging | code snapshot | PyPI package, CLI, docs site, CI, release workflow |
What Is Special About This Package
The main differentiator is that the package separates three concerns cleanly:
modelsLocal RF and DP-RF training.federationTree ranking, voting, and aggregation.distributedPartitioning, parallel client training, strategy search, and report export.
That makes it useful for both:
- researchers comparing aggregation strategies under non-IID data
- engineers who want a callable library instead of notebook-only code
Good Use Cases
- Network intrusion detection where traffic distributions differ by site.
- Fraud or risk scoring across branches, regions, or subsidiaries.
- Edge/IoT classification where each site owns a small, skewed local dataset.
- Privacy-sensitive health or security workflows that need federated baselines.
- Benchmarking how aggregation strategies behave under controlled heterogeneity.
Install
pip install distributed-random-forest
From source:
git clone https://github.com/Bowenislandsong/distributed_random_forrest
cd distributed_random_forrest
python -m pip install -e ".[dev,docs]"
Quick Examples
1. End-to-end federated training
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from distributed_random_forest import FederatedRandomForest
X, y = make_classification(
n_samples=1200,
n_features=20,
n_classes=3,
n_informative=10,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y,
)
model = FederatedRandomForest(
n_clients=4,
rf_params={"n_estimators": 24, "random_state": 42, "voting": "weighted"},
partition_strategy="dirichlet",
partition_kwargs={"alpha": 0.8},
aggregation_strategy="auto",
execution_backend="thread",
max_workers=4,
random_state=42,
)
model.fit(X_train, y_train)
metrics = model.evaluate(X_test, y_test)
print(model.selected_strategy)
print(metrics)
2. Quick CLI smoke test
drf-quickstart --clients 4 --partition-strategy dirichlet --backend thread
3. Differential privacy baseline
from distributed_random_forest import FederatedRandomForest
model = FederatedRandomForest(
n_clients=5,
rf_params={"n_estimators": 20, "random_state": 13},
partition_strategy="stratified",
aggregation_strategy="top_k_global_balanced_accuracy",
use_differential_privacy=True,
epsilon=10.0,
random_state=13,
)
More runnable examples:
Performance Snapshot
The table below comes from a local single-run benchmark on a synthetic multiclass dataset with 6,000 samples, 40 features, and 4 classes. It is meant to show the relative behavior of this implementation, not to claim a universal leaderboard. You can reproduce it with examples/performance_benchmark.py.
| Scenario | Accuracy | Balanced Acc. | Weighted Acc. | F1 | Time (s) | Strategy |
|---|---|---|---|---|---|---|
| Centralized RF | 0.8642 | 0.8641 | 0.7467 | 0.8640 | 0.35 | n/a |
| Federated uniform | 0.7842 | 0.7840 | 0.6148 | 0.7833 | 1.44 | proportional_weighted_accuracy |
| Federated dirichlet | 0.7642 | 0.7642 | 0.5840 | 0.7599 | 1.42 | proportional_weighted_accuracy |
| Federated dirichlet + DP | 0.5125 | 0.5129 | 0.2629 | 0.4950 | 0.74 | top_k_global_balanced_accuracy |
What this shows:
- the package preserves a large share of centralized accuracy under realistic federated splits
- non-IID partitions are supported as first-class workflows, not afterthought scripts
- DP support is available, with the expected privacy/utility tradeoff clearly visible
Supported Distributed RF Patterns
Partitioning
uniformstratifiedfeaturesizeddirichletlabel_skew
Aggregation
rf_s_dts_arf_s_dts_warf_s_dts_a_allrf_s_dts_wa_alltop_k_global_balanced_accuracytop_k_global_f1proportional_weighted_accuracyproportional_balanced_accuracythreshold_weighted_accuracy- automatic strategy search through
FederatedRandomForest(aggregation_strategy="auto")
What You Get In The Package
RandomForestandDPRandomForestfor local modelsClientRFandDPClientRFfor client-scoped training and evaluationFederatedAggregatorfor explicit tree-selection experimentsFederatedRandomForestfor end-to-end orchestration- partitioning utilities and JSON run report export
- a CLI, examples, tests, docs, and release automation
Documentation
- Docs site: bowenislandsong.github.io/distributed_random_forrest
- Docs source: docs/
Development
make lint
make test
make docs
make build
CI/CD
The repository includes workflows for:
- linting and tests on pushes and pull requests
- package build validation
- GitHub Pages deployment
- PyPI publishing on GitHub releases
Project Structure
distributed_random_forest/
distributed/ # orchestration and partitioning
experiments/ # experiment pipelines
federation/ # aggregation and voting
models/ # local RF implementations
docs/ # GitHub Pages / MkDocs site
examples/ # runnable use cases and benchmark scripts
tests/ # regression and end-to-end coverage
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distributed_random_forest-0.3.0.tar.gz.
File metadata
- Download URL: distributed_random_forest-0.3.0.tar.gz
- Upload date:
- Size: 48.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f89ef953917c3b9a545ae6979c99d834d79040b171d6c61162226f4674d4961
|
|
| MD5 |
7f91f09c9889093f9ec32c109b1d7b3d
|
|
| BLAKE2b-256 |
4bc60329f48249933f4b70cfe542aa89303fcd4471c6472fdd4fe50af911792a
|
File details
Details for the file distributed_random_forest-0.3.0-py3-none-any.whl.
File metadata
- Download URL: distributed_random_forest-0.3.0-py3-none-any.whl
- Upload date:
- Size: 43.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b04b4960d5b0ec07d8d1bf3f13deee300845cee0d2eb255a326c06c5dff822c0
|
|
| MD5 |
b3bdd885d439d75cc0eabca2bdab0253
|
|
| BLAKE2b-256 |
ede41496e87bc8d31fb87c90d8b1ab84e751547fca9c9abdbed923720a8eceb1
|