Skip to main content

Enterprise-ready distributed and federated Random Forests with orchestration, strategy search, and optional differential privacy

Project description

Distributed Random Forest

PyPI version Python Versions CI Docs License PyPI Downloads

Federated and distributed Random Forest training: multiple clients each learn a local forest, you merge decision trees into a global model, and you can add differential privacy (DP) at the client. The design is inspired by research on Random Forest with Differential Privacy in Federated Learning; this codebase is general-purpose and ships as a reusable package (not just experiment scripts).

At a glance

  • Train RFs on many clients; aggregate with tree-ranking strategies (per-client, global, proportional, threshold, and more).
  • Gini or entropy splits; simple or weighted voting; parallel n_jobs for scoring and merged prediction.
  • DP random forests and federated DP vs non-DP comparisons.
  • CRF orchestration (FederatedRandomForest), CLI (drf-quickstart), scripts for EXP 1–4-style pipelines, docs, and CI.

Creator

Maintained by Bowen Song (USC Viterbi): health AI, federated learning, explainable AI, and scalable ML systems.

Why this implementation is different

Area Typical paper-style repo This implementation
Distributed workflow manual scripts FederatedRandomForest orchestration API
Client heterogeneity mostly uniform splits uniform, stratified, feature, sized, dirichlet, label_skew
Tree aggregation one or two rules paper baselines plus balanced, proportional, threshold, and auto search
Execution sequential only sequential, thread, and process backends
Parallelism ad hoc n_jobs for aggregation scoring and merged RF (parity-tested vs sequential)
Reporting print statements JSON run reports with partition, client, and strategy summaries
Privacy often omitted built-in DP RF for experimentation and comparison
Packaging code snapshot PyPI, CLI, docs, CI, release workflow

Good use cases

  • Network/security analytics with site-specific data distributions.
  • Fraud or risk scoring across branches or regions.
  • Edge/IoT with small, skewed local datasets.
  • Health or privacy-sensitive federated baselines.
  • Benchmarking aggregation strategies under controlled heterogeneity.

Documentation

  • Online: GitHub Pages — ensure Pages is set to “GitHub Actions” as the source if the site is not live.
  • Local: pip install -e ".[docs]" and mkdocs serve.

Contents: concepts · patterns · pipeline · getting started · examples · repository · citing

Install

pip install distributed-random-forest

From source:

git clone https://github.com/Bowenislandsong/distributed_random_forest.git
cd distributed_random_forest
python -m pip install -e ".[dev,docs]"

Quick examples

1. End-to-end federated training

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from distributed_random_forest import FederatedRandomForest

X, y = make_classification(
    n_samples=1200,
    n_features=20,
    n_classes=3,
    n_informative=10,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

model = FederatedRandomForest(
    n_clients=4,
    rf_params={"n_estimators": 24, "random_state": 42, "voting": "weighted"},
    partition_strategy="dirichlet",
    partition_kwargs={"alpha": 0.8},
    aggregation_strategy="auto",
    execution_backend="thread",
    max_workers=4,
    random_state=42,
)

model.fit(X_train, y_train)
metrics = model.evaluate(X_test, y_test)
print(model.selected_strategy)
print(metrics)

2. Single-site RandomForest

from distributed_random_forest import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=20, n_classes=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf = RandomForest(
    n_estimators=100, criterion="gini", voting="simple", random_state=42
)
rf.fit(X_train, y_train)
print(f"Accuracy: {rf.score(X_test, y_test):.4f}")

More snippets (federation, DP, EXP3) are in the examples page.

3. CLI

drf-quickstart --clients 4 --partition-strategy dirichlet --backend thread

4. Differential privacy (orchestrated)

from distributed_random_forest import FederatedRandomForest

model = FederatedRandomForest(
    n_clients=5,
    rf_params={"n_estimators": 20, "random_state": 13},
    partition_strategy="stratified",
    aggregation_strategy="top_k_global_balanced_accuracy",
    use_differential_privacy=True,
    epsilon=10.0,
    random_state=13,
)

Runnable scripts in the repo

Experiment drivers

python run_exp1_hparams.py
python run_exp2_clients.py
python run_exp3_federation.py
python run_exp4_dp_federation.py

UCI benchmark (Wisconsin breast cancer via scikit-learn): test accuracy and prediction latency (central vs federated):

python examples/benchmark_public_dataset.py
python examples/benchmark_public_dataset.py --quick

Performance snapshot

Reproduce with examples/performance_benchmark.py.

Scenario Accuracy Balanced Acc. Weighted Acc. F1 Time (s) Strategy
Centralized RF 0.8642 0.8641 0.7467 0.8640 0.35 n/a
Federated uniform 0.7842 0.7840 0.6148 0.7833 1.44 proportional_weighted_accuracy
Federated dirichlet 0.7642 0.7642 0.5840 0.7599 1.42 proportional_weighted_accuracy
Federated dirichlet + DP 0.5125 0.5129 0.2629 0.4950 0.74 top_k_global_balanced_accuracy

Supported distributed RF patterns

Partitioning: uniform, stratified, feature, sized, dirichlet, label_skew.

Aggregation (high level): rf_s_dts_a, rf_s_dts_wa, rf_s_dts_a_all, rf_s_dts_wa_all, top_k_global_balanced_accuracy, top_k_global_f1, proportional_weighted_accuracy, proportional_balanced_accuracy, threshold_weighted_accuracy, and aggregation_strategy="auto" in FederatedRandomForest.

What you get in the package

  • RandomForest and DPRandomForest for local models
  • ClientRF and DPClientRF for client-scoped training
  • FederatedAggregator for explicit tree-selection experiments
  • FederatedRandomForest for end-to-end orchestration
  • partitioning utilities, JSON run report export, CLI, tests, and docs

Development

make lint
make test
make docs
make build

(python -m pytest, python -m ruff check ., etc., if you do not use make.)

CI/CD

Workflows cover lint, tests, package build, GitHub Pages (docs), and optional PyPI publishing on releases.

Project structure

distributed_random_forest/
  distributed/   # orchestration and partitioning
  experiments/   # experiment pipelines
  federation/    # aggregation and voting
  models/        # local RF implementations
docs/            # MkDocs site
examples/        # benchmarks and use cases
tests/           # regression and e2e coverage

Run tests

pytest tests/ -v
pytest tests/ --cov=distributed_random_forest

Cite

BibTeX and APA: License & citation (or docs/citing.md in the repo).

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distributed_random_forest-0.4.0.tar.gz (57.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distributed_random_forest-0.4.0-py3-none-any.whl (47.8 kB view details)

Uploaded Python 3

File details

Details for the file distributed_random_forest-0.4.0.tar.gz.

File metadata

File hashes

Hashes for distributed_random_forest-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0c8e80551f6e9e6d0bd077f7cf921db28531f9172904a16ac201730e6fd16a1a
MD5 072667e6c3defd345f55cdfb945b85fd
BLAKE2b-256 e9908771cac4f7251af4ca9546a655a5c483b4c7aae4d0f5ca767b4bd13f8577

See more details on using hashes here.

File details

Details for the file distributed_random_forest-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for distributed_random_forest-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed0d57ed057dccbdbe0bdf7fd270bf4b2f4a512c132d94d102a7aa861887a8c1
MD5 23cdf596dbbfb7108fb129ce060ca6ba
BLAKE2b-256 7207d11cf44555cb112733582245a99b11714a23a6298880d0c2d0973a6b878f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page