Enterprise-ready distributed and federated Random Forests with orchestration, strategy search, and optional differential privacy

These details have not been verified by PyPI

Project links

Project description

Distributed Random Forest

Distributed Random Forest is a Python package for federated and distributed tree ensembles. It is designed for people who need more than "train a few local forests and concatenate the trees":

realistic client partitioning, including non-IID splits
multiple aggregation strategies instead of one hard-coded merge rule
parallel client training backends
structured reports for benchmarking and audits
runnable examples, docs, tests, CI, and PyPI packaging

This README is used on both GitHub and PyPI. The top half is package-user focused so the PyPI page answers "why should I install this?" before it dives into repo internals.

Why This Implementation Is Different

Most distributed RF repositories are really experiment scripts. This one is a reusable package with a benchmarkable orchestration layer.

Area	Typical paper-style repo	This implementation
Distributed workflow	manual scripts	`FederatedRandomForest` orchestration API
Client heterogeneity	mostly uniform splits	`uniform`, `stratified`, `feature`, `sized`, `dirichlet`, `label_skew`
Tree aggregation	one or two ranking rules	classic paper rules plus balanced, proportional, threshold, and auto search
Execution	sequential only	sequential, thread, and process backends
Reporting	print statements	JSON run reports with partition, client, and strategy summaries
Privacy	often omitted	built-in DP RF support for experimentation and comparison
Packaging	code snapshot	PyPI package, CLI, docs site, CI, release workflow

What Is Special About This Package

The main differentiator is that the package separates three concerns cleanly:

models Local RF and DP-RF training.
federation Tree ranking, voting, and aggregation.
distributed Partitioning, parallel client training, strategy search, and report export.

That makes it useful for both:

researchers comparing aggregation strategies under non-IID data
engineers who want a callable library instead of notebook-only code

Good Use Cases

Network intrusion detection where traffic distributions differ by site.
Fraud or risk scoring across branches, regions, or subsidiaries.
Edge/IoT classification where each site owns a small, skewed local dataset.
Privacy-sensitive health or security workflows that need federated baselines.
Benchmarking how aggregation strategies behave under controlled heterogeneity.

Install

pip install distributed-random-forest

From source:

git clone https://github.com/Bowenislandsong/distributed_random_forrest
cd distributed_random_forrest
python -m pip install -e ".[dev,docs]"

Quick Examples

1. End-to-end federated training

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from distributed_random_forest import FederatedRandomForest

X, y = make_classification(
    n_samples=1200,
    n_features=20,
    n_classes=3,
    n_informative=10,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

model = FederatedRandomForest(
    n_clients=4,
    rf_params={"n_estimators": 24, "random_state": 42, "voting": "weighted"},
    partition_strategy="dirichlet",
    partition_kwargs={"alpha": 0.8},
    aggregation_strategy="auto",
    execution_backend="thread",
    max_workers=4,
    random_state=42,
)

model.fit(X_train, y_train)
metrics = model.evaluate(X_test, y_test)
print(model.selected_strategy)
print(metrics)

2. Quick CLI smoke test

drf-quickstart --clients 4 --partition-strategy dirichlet --backend thread

3. Differential privacy baseline

from distributed_random_forest import FederatedRandomForest

model = FederatedRandomForest(
    n_clients=5,
    rf_params={"n_estimators": 20, "random_state": 13},
    partition_strategy="stratified",
    aggregation_strategy="top_k_global_balanced_accuracy",
    use_differential_privacy=True,
    epsilon=10.0,
    random_state=13,
)

More runnable examples:

Performance Snapshot

The table below comes from a local single-run benchmark on a synthetic multiclass dataset with 6,000 samples, 40 features, and 4 classes. It is meant to show the relative behavior of this implementation, not to claim a universal leaderboard. You can reproduce it with examples/performance_benchmark.py.

Scenario	Accuracy	Balanced Acc.	Weighted Acc.	F1	Time (s)	Strategy
Centralized RF	0.8642	0.8641	0.7467	0.8640	0.35	n/a
Federated uniform	0.7842	0.7840	0.6148	0.7833	1.44	`proportional_weighted_accuracy`
Federated dirichlet	0.7642	0.7642	0.5840	0.7599	1.42	`proportional_weighted_accuracy`
Federated dirichlet + DP	0.5125	0.5129	0.2629	0.4950	0.74	`top_k_global_balanced_accuracy`

What this shows:

the package preserves a large share of centralized accuracy under realistic federated splits
non-IID partitions are supported as first-class workflows, not afterthought scripts
DP support is available, with the expected privacy/utility tradeoff clearly visible

Supported Distributed RF Patterns

Partitioning

uniform
stratified
feature
sized
dirichlet
label_skew

Aggregation

rf_s_dts_a
rf_s_dts_wa
rf_s_dts_a_all
rf_s_dts_wa_all
top_k_global_balanced_accuracy
top_k_global_f1
proportional_weighted_accuracy
proportional_balanced_accuracy
threshold_weighted_accuracy
automatic strategy search through FederatedRandomForest(aggregation_strategy="auto")

What You Get In The Package

RandomForest and DPRandomForest for local models
ClientRF and DPClientRF for client-scoped training and evaluation
FederatedAggregator for explicit tree-selection experiments
FederatedRandomForest for end-to-end orchestration
partitioning utilities and JSON run report export
a CLI, examples, tests, docs, and release automation

Documentation

Docs site: bowenislandsong.github.io/distributed_random_forrest
Docs source: docs/

Development

make lint
make test
make docs
make build

CI/CD

The repository includes workflows for:

linting and tests on pushes and pull requests
package build validation
GitHub Pages deployment
PyPI publishing on GitHub releases

Project Structure

distributed_random_forest/
  distributed/   # orchestration and partitioning
  experiments/   # experiment pipelines
  federation/    # aggregation and voting
  models/        # local RF implementations
docs/            # GitHub Pages / MkDocs site
examples/        # runnable use cases and benchmark scripts
tests/           # regression and end-to-end coverage

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Apr 24, 2026

0.3.1

Mar 13, 2026

This version

0.3.0

Mar 13, 2026

0.2.0

Dec 2, 2025

0.1.0

Dec 2, 2025

0.0.3

Dec 3, 2025

0.0.2

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distributed_random_forest-0.3.0.tar.gz (48.7 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distributed_random_forest-0.3.0-py3-none-any.whl (43.3 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file distributed_random_forest-0.3.0.tar.gz.

File metadata

Download URL: distributed_random_forest-0.3.0.tar.gz
Upload date: Mar 13, 2026
Size: 48.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for distributed_random_forest-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6f89ef953917c3b9a545ae6979c99d834d79040b171d6c61162226f4674d4961`
MD5	`7f91f09c9889093f9ec32c109b1d7b3d`
BLAKE2b-256	`4bc60329f48249933f4b70cfe542aa89303fcd4471c6472fdd4fe50af911792a`

See more details on using hashes here.

File details

Details for the file distributed_random_forest-0.3.0-py3-none-any.whl.

File metadata

Download URL: distributed_random_forest-0.3.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 43.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for distributed_random_forest-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b04b4960d5b0ec07d8d1bf3f13deee300845cee0d2eb255a326c06c5dff822c0`
MD5	`b3bdd885d439d75cc0eabca2bdab0253`
BLAKE2b-256	`ede41496e87bc8d31fb87c90d8b1ab84e751547fca9c9abdbed923720a8eceb1`

See more details on using hashes here.

distributed-random-forest 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Distributed Random Forest

Why This Implementation Is Different

What Is Special About This Package

Good Use Cases

Install

Quick Examples

1. End-to-end federated training

2. Quick CLI smoke test

3. Differential privacy baseline

Performance Snapshot

Supported Distributed RF Patterns

Partitioning

Aggregation

What You Get In The Package

Documentation

Development

CI/CD

Project Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes