Distributed/Federated Random Forest framework with Differential Privacy support

These details have not been verified by PyPI

Project links

Project description

Distributed Random Forest with Differential Privacy

This repository implements a Distributed / Federated Random Forest (RF) framework inspired by:

"Random Forest with Differential Privacy in Federated Learning Framework for Network Attack Detection and Classification."

The implementation includes:

RF training on multiple distributed clients
Aggregation of decision trees into a global RF
Differential Privacy (DP) support
Extensive evaluation pipelines and hyperparameter selection

1. Core Ideas

Splitting Rules

We support the two classical RF impurity measures:

Gini index — favors isolating the largest homogeneous class.
Entropy — aims to minimize within-node class diversity.

Ensemble Voting Methods

For local RF inference:

Simple Voting (SV): majority vote across decision trees.
Weighted Voting (WV): majority vote weighted by each DT's class-specific accuracy.

2. Federated Aggregation of Trees

After each client trains its own RF, decision trees (DTs) are merged into a global RF using four strategies:

Sorting DTs Within Each RF

RF_S_DTs_A — Sort DTs by validation accuracy within each client RF and select the top performers.
RF_S_DTs_WA — Same as above, but sort by weighted accuracy (WA).

Sorting DTs Across All Clients

RF_S_DTs_A_All — Collect all DTs from all clients, sort globally by accuracy, select best N.
RF_S_DTs_WA_All — Global sorting of all DTs by weighted accuracy.

These merging strategies allow the global RF to retain the strongest trees from heterogeneous local models.

3. Evaluation Metrics

Accuracy (A)

Overall DT accuracy on the validation set.

Weighted Accuracy (WA)

DT accuracy × (mean per-class accuracy). Prioritizes trees that perform consistently across multiple classes.

Other metrics

F1 Score (macro or weighted depending on experiment)
Client-to-global performance gap
DP degradation curves

4. Experimental Pipeline

EXP 1 — RF Hyperparameter Selection

Performed before federated splitting. Grid search over:

Number of trees (odd numbers 1–100)
Splitting rule (gini, entropy)
Ensemble rule (SV, WV)

The best configuration is used for all remaining experiments.

EXP 2 — Independent RFs Per Client

Each client trains RFs independently using the best configuration from EXP 1.

Three data-partitioning strategies are evaluated:

EXP 2.1 — Feature-based Partitioning

Subsets created based on a specific feature criterion. Testing:

Only on the client's own subset
On the full global test set

EXP 2.2 — Uniform Random Partitioning

Clients receive equal amounts of random samples. Testing on the full test set.

EXP 2.3 — Random Partitioning with EXP 2.1 Sample Counts

Mimics the subset sizes from EXP 2.1 but randomizes the samples. Testing on the full test set.

EXP 3 — Global RF from Federated Aggregation

Independent client RFs are merged using the 4 strategies:

RF_S_DTs_A
RF_S_DTs_WA
RF_S_DTs_A_All
RF_S_DTs_WA_All

The global RF is evaluated on the full test set and compared to:

Independent RF performance
Best‐client performance
Baseline centralized RF (if provided)

EXP 4 — Federated RF with Differential Privacy

Each client trains a DP-Random Forest using per-client differential privacy.

Tested ε values:

0.1, 0.5, 1, 5

Pipeline:

Train DP-RF per client
Evaluate each DP-RF on the full test set
Merge using the best aggregation strategy determined in EXP 3
Compare:
- DP-client RF
- Federated DP Global RF
- Non-DP Global RF

5. Summary of Enhancements in This Implementation

Clean modular design of RF, client trainers, and federated aggregator
Support for Gini, Entropy, SV, WV
Four global aggregation algorithms implemented
Weighted accuracy for tree ranking
Full experiment pipeline (EXP 1 → EXP 4) implemented in code
Differential privacy integrated at client training level
Extensible API for additional DP mechanisms (Gaussian, Laplace, tree-level clipping, etc.)

6. Getting Started

Installation

Install from PyPI (coming soon)

pip install distributed-random-forest

Install from source (development mode)

git clone https://github.com/Bowenislandsong/distributed_random_forest
cd distributed_random_forest
pip install -e .

For development with test dependencies:

pip install -e ".[dev]"

Run Experiments

python run_exp1_hparams.py    # Hyperparameter selection
python run_exp2_clients.py    # Independent client training
python run_exp3_federation.py # Federated aggregation
python run_exp4_dp_federation.py # DP federation

Run Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=models --cov=federation --cov=experiments

# Run specific test suites
pytest tests/test_tree_utils.py -v      # Unit tests for utilities
pytest tests/test_random_forest.py -v   # Unit tests for RF
pytest tests/test_dp_rf.py -v           # Unit tests for DP-RF
pytest tests/test_aggregator.py -v      # Unit tests for aggregation
pytest tests/test_e2e.py -v             # End-to-end tests

7. Usage Examples

Basic Random Forest Training

from distributed_random_forest import RandomForest
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train RF with Gini criterion and simple voting
rf = RandomForest(n_estimators=100, criterion='gini', voting='simple', random_state=42)
rf.fit(X_train, y_train)

# Evaluate
accuracy = rf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Federated Learning with Multiple Clients

from distributed_random_forest import ClientRF, FederatedAggregator
from distributed_random_forest.experiments.exp2_clients import partition_uniform_random

# Partition data for 5 clients
partitions = partition_uniform_random(X_train, y_train, n_clients=5, random_state=42)

# Train RF on each client
clients = []
for i, (X_client, y_client) in enumerate(partitions):
    client = ClientRF(client_id=i, rf_params={'n_estimators': 20, 'random_state': i})
    client.train(X_client, y_client)
    clients.append(client)

# Aggregate trees using RF_S_DTs_A strategy
aggregator = FederatedAggregator(strategy='rf_s_dts_a', n_trees_per_client=10)
aggregator.aggregate(clients, X_val, y_val)

# Build and evaluate global RF
global_rf = aggregator.build_global_rf(clients[0].rf._classes)
metrics = aggregator.evaluate(X_test, y_test)
print(f"Global RF Accuracy: {metrics['accuracy']:.4f}")

Differential Privacy Training

from distributed_random_forest import DPRandomForest, DPClientRF

# Train DP-RF with epsilon=1.0 (Laplace mechanism)
dp_rf = DPRandomForest(
    n_estimators=50,
    epsilon=1.0,
    dp_mechanism='laplace',
    random_state=42
)
dp_rf.fit(X_train, y_train)
print(f"Privacy budget: ε={dp_rf.get_privacy_budget()}")

# DP client for federated learning
dp_client = DPClientRF(client_id=0, epsilon=0.5, rf_params={'n_estimators': 20})
dp_client.train(X_client, y_client)

Comparing Aggregation Strategies

from distributed_random_forest.experiments.exp3_global_rf import run_exp3_federated_aggregation

results = run_exp3_federated_aggregation(
    client_rfs=clients,
    X_val=X_val,
    y_val=y_val,
    X_test=X_test,
    y_test=y_test,
    n_trees_per_client=10,
    verbose=True
)

print(f"Best strategy: {results['best_strategy']}")
print(f"Best accuracy: {results['best_accuracy']:.4f}")

8. Repository Structure

distributed_random_forest/
│
├── distributed_random_forest/  # Main package
│   ├── __init__.py             # Package exports (public API)
│   ├── data/                   # Raw and processed datasets
│   ├── models/
│   │   ├── random_forest.py    # Core RF implementation
│   │   ├── dp_rf.py            # Differentially private RF
│   │   └── tree_utils.py       # Utility functions for metrics
│   ├── federation/
│   │   ├── aggregator.py       # DT aggregation strategies (A, WA, All)
│   │   └── voting.py           # SV, WV methods
│   └── experiments/
│       ├── exp1_hparams.py     # Hyperparameter selection
│       ├── exp2_clients.py     # Independent client training
│       ├── exp3_global_rf.py   # Federated aggregation
│       └── exp4_dp_rf.py       # DP federation
├── tests/
│   ├── test_tree_utils.py      # Unit tests for utilities
│   ├── test_random_forest.py   # Unit tests for RF
│   ├── test_dp_rf.py           # Unit tests for DP-RF
│   ├── test_voting.py          # Unit tests for voting
│   ├── test_aggregator.py      # Unit tests for aggregation
│   └── test_e2e.py             # End-to-end tests
├── .github/workflows/
│   └── tests.yml               # CI/CD workflow
├── requirements.txt            # Python dependencies
└── README.md                   # You are here

How to Cite

If you use this project in your research, please cite it as:

BibTeX

@software{distributed_random_forest,
  author = {Bowenislandsong},
  title = {Distributed Random Forest with Differential Privacy},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/Bowenislandsong/distributed_random_forest}
}

APA

Bowenislandsong. (2024). Distributed Random Forest with Differential Privacy [Computer software]. GitHub. https://github.com/Bowenislandsong/distributed_random_forest

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Apr 24, 2026

0.3.1

Mar 13, 2026

0.3.0

Mar 13, 2026

This version

0.2.0

Dec 2, 2025

0.1.0

Dec 2, 2025

0.0.3

Dec 3, 2025

0.0.2

Dec 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distributed_random_forest-0.2.0.tar.gz (36.9 kB view details)

Uploaded Dec 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distributed_random_forest-0.2.0-py3-none-any.whl (32.3 kB view details)

Uploaded Dec 2, 2025 Python 3

File details

Details for the file distributed_random_forest-0.2.0.tar.gz.

File metadata

Download URL: distributed_random_forest-0.2.0.tar.gz
Upload date: Dec 2, 2025
Size: 36.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for distributed_random_forest-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`592a4408485f3acad87b2e4eedc7636a8a811d8ecfd3efff96812c066e8ee48e`
MD5	`cc4692ee1c2f3ca753d056e265eeb6db`
BLAKE2b-256	`e80c5fce863ebaaf6f59fcfa01662c197ef59351b354cede7f08b56e91b57bfb`

See more details on using hashes here.

File details

Details for the file distributed_random_forest-0.2.0-py3-none-any.whl.

File metadata

Download URL: distributed_random_forest-0.2.0-py3-none-any.whl
Upload date: Dec 2, 2025
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for distributed_random_forest-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aacd3e329524a83ad638de863156a71089901473727579e311c0661ee9fe7b42`
MD5	`4968b442559258877b17e589a3a76c05`
BLAKE2b-256	`90d046830a555cd7931060690c58784695c66bdce9958ae5f491a3149d3f425f`

See more details on using hashes here.

distributed-random-forest 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Distributed Random Forest with Differential Privacy

1. Core Ideas

Splitting Rules

Ensemble Voting Methods

2. Federated Aggregation of Trees

Sorting DTs Within Each RF

Sorting DTs Across All Clients

3. Evaluation Metrics

Accuracy (A)

Weighted Accuracy (WA)

Other metrics

4. Experimental Pipeline

EXP 1 — RF Hyperparameter Selection

EXP 2 — Independent RFs Per Client

EXP 2.1 — Feature-based Partitioning

EXP 2.2 — Uniform Random Partitioning

EXP 2.3 — Random Partitioning with EXP 2.1 Sample Counts

EXP 3 — Global RF from Federated Aggregation

EXP 4 — Federated RF with Differential Privacy

5. Summary of Enhancements in This Implementation

6. Getting Started

Installation

Install from PyPI (coming soon)

Install from source (development mode)

Run Experiments

Run Tests

7. Usage Examples

Basic Random Forest Training

Federated Learning with Multiple Clients

Differential Privacy Training

Comparing Aggregation Strategies

8. Repository Structure

How to Cite

BibTeX

APA

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes