Skip to main content

Distributed ML training across MacBooks. Zero config.

Project description

grove

Distributed ML training across MacBooks. Zero config.

pip install grove-ml

Mac A:

grove start train.py -n 2

Mac B:

grove join

Both machines discover each other automatically, sync gradients, and train together. No SSH, no IP addresses, no configuration files.

Grove discovers peers over AWDL (the protocol behind AirDrop), then upgrades to direct WiFi when both devices share a network. If WiFi isn't available (e.g. eduroam, or no network at all), everything stays on AWDL.

Quick start

Write a training script with a main() function:

# train.py
import grove
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim

def main():
    world = grove.init()

    model = nn.Linear(64, 64)
    optimizer = optim.SGD(learning_rate=0.01)

    for step in range(100):
        x = mx.random.normal((8, 64))
        y = mx.random.normal((8, 64))

        loss, grads = nn.value_and_grad(model, lambda m, x, y: mx.mean((m(x) - y) ** 2))(model, x, y)
        grads = grove.average_gradients(grads)
        optimizer.update(model, grads)
        mx.eval(model.state, optimizer.state)

Single device:

grove run train.py

Multiple devices:

grove start train.py -n 2    # coordinator
grove join                    # worker (shows interactive picker)

Workers receive the training script from the coordinator automatically.

Algorithms

DiLoCo

Each device trains independently for H steps, then syncs pseudo-gradients with Nesterov momentum. Good default for most setups.

diloco = grove.diloco(model, H=500, outer_lr=0.7)

for step in range(total_steps):
    loss, grads = loss_and_grad(model, batch)
    optimizer.update(model, grads)
    mx.eval(model.state, optimizer.state)
    diloco.step(model)
Parameter Default Description
H 500 Inner steps between syncs
outer_lr 0.7 Outer optimizer learning rate
outer_momentum 0.9 Nesterov momentum
overlap False Async overlap (sync in background)
quantize False E3M0 4-bit pseudo-gradients

SparseLoCo

DiLoCo with top-k compression and error feedback. Sends only the largest 1-3% of values each round, with unsent values carrying forward. ~32x less communication than dense DiLoCo.

sloco = grove.sparseloco(model, H=500, topk=64, chunk=4096)

for step in range(total_steps):
    loss, grads = loss_and_grad(model, batch)
    optimizer.update(model, grads)
    mx.eval(model.state, optimizer.state)
    sloco.step(model)
Parameter Default Description
H 30 Inner steps between syncs
outer_lr 1.0 Outer optimizer learning rate
topk 64 Values kept per chunk
chunk 4096 Chunk size for top-k selection
error_decay 0.95 Decay on error buffer
overlap True Async overlap (on by default)

DeMo

DCT-compressed per-step sync. Transforms gradients to frequency space and sends the most significant components. Syncs every step rather than every H steps. Better suited for fast local networks.

demo = grove.demo(model, lr=1e-3, topk=32)

for step in range(total_steps):
    loss, grads = loss_and_grad(model, batch)
    demo.step(model, grads)
Parameter Default Description
lr 1e-3 Learning rate
decay 0.999 EMA decay
topk 32 DCT components kept per chunk
chunk 64 Chunk size

API

Initialization

world = grove.init()
world.rank()   # this device's rank (0 = coordinator)
world.size()   # total number of devices

Collective operations

grove.average_gradients(grads)  # all-reduce + average
grove.all_sum(x)                # sum an MLX array across devices
grove.all_gather(x)             # gather an MLX array from all devices
grove.send(x, dst)              # send to a specific rank
grove.recv(shape, dtype, src)   # receive from a specific rank
grove.barrier()                 # wait for all devices
grove.report(loss)              # report loss to dashboard

Status

grove.rank          # int
grove.world_size    # int
grove.is_available() # True if world_size > 1

CLI

grove run <script>              Run on a single device
grove start <script> -n N       Start a cluster with N nodes
grove start <script> --name X   Start with a specific cluster name
grove join [name]               Join a cluster (interactive picker if no name)
grove status                    System info and nearby clusters

Add --logs to any command to see raw log output instead of the dashboard.

Environment variables

Variable Effect
GROVE_NO_WIFI Skip WiFi upgrade probe, use AWDL only

Requirements

  • macOS with Apple Silicon (M1+)
  • Python 3.10+
  • MLX
  • Xcode command-line tools (for compiling the Swift helper on first run)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grove_ml-0.1.0.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grove_ml-0.1.0-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file grove_ml-0.1.0.tar.gz.

File metadata

  • Download URL: grove_ml-0.1.0.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for grove_ml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3d58281d19ddb43c2463c347d37c486c9b90acbd414d58fb1fbd1bca73567ccc
MD5 b93d517569d948e41b04aae80dfe356d
BLAKE2b-256 fbc35c927ac7c197f02519cc1f68de072377e033fda58fc61a37f0485902f6b7

See more details on using hashes here.

File details

Details for the file grove_ml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: grove_ml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for grove_ml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33e42eb22cd230cc585ad2d6c269898fdbb8600a22915eae81bda580f4821990
MD5 3a35dc9b1b0e0afb16dae113f45e8da7
BLAKE2b-256 b7d563fe4b02c5180e672d5947f7824b478386b3cdc3f1f17856accb83e850de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page