Skip to main content

No project description provided

Project description

higgsfield - multi node training without crying

Higgsfield is an open-source, fault-tolerant, highly scalable cluster management, and a machine learning framework designed for training models with billions to trillions of parameters, such as Large Language Models (LLMs).

PyPI version

architecture

Higgsfield serves as a cluster workload manager and machine learning framework with five primary functions:

  1. Allocating exclusive and non-exclusive access to compute resources (nodes) to users for their training tasks.
  2. Supporting ZeRO-3 deepspeed API and fully sharded data parallel API of PyTorch, enabling efficient sharding for trillion-parameter models.
  3. Offering a framework for initiating, executing, and monitoring the training of large neural networks on allocated nodes.
  4. Managing resource contention by maintaining a queue for running experiments.
  5. Facilitating continuous integration of machine learning development through seamless integration with GitHub and GitHub Actions. Higgsfield streamlines the process of training massive models and empowers developers with a versatile and robust toolset.

Install

$ pip install higgsfield==0.0.3

Train example

That's all you have to do in order to train LLaMa in a distributed setting:

from higgsfield.llama import Llama70b
from higgsfield.loaders import LlamaLoader
from higgsfield.experiment import experiment

import torch.optim as optim
from alpaca import get_alpaca_data

@experiment("alpaca")
def train(params):
    model = Llama70b(zero_stage=3, fast_attn=False, precision="bf16")

    optimizer = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.0)

    dataset = get_alpaca_data(split="train")
    train_loader = LlamaLoader(dataset, max_words=2048)

    for batch in train_loader:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

    model.push_to_hub('alpaca-70b')

How it's all done?

  1. We install all the required tools in your server (Docker, your project's deploy keys, higgsfield binary).
  2. Then we generate deploy & run workflows for your experiments.
  3. As soon as it gets into Github, it will automatically deploy your code on your nodes.
  4. Then you access your experiments' run UI through Github, which will launch experiments and save the checkpoints.

Design

We follow the standard pytorch workflow. Thus you can incorporate anything besides what we provide, deepspeed, accelerate, or just implement your custom pytorch sharding from scratch.

Enviroment hell

No more different versions of pytorch, nvidia drivers, data processing libraries. You can easily orchestrate experiments and their environments, document and track the specific versions and configurations of all dependencies to ensure reproducibility.

Config hell

No need to define 600 arguments for your experiment. No more yaml witchcraft. You can use whatever you want, whenever you want. We just introduce a simple interface to define your experiments. We have even taken it further, now you only need to design the way to interact.

Compatibility

We need you to have nodes with:

  • Ubuntu
  • SSH access
  • Non-root user with sudo privileges (no-password is required)

Clouds we have tested on:

  • LambdaLabs
  • FluidStack

Feel free to open an issue if you have any problems with other clouds.

Getting started

Setup

Here you can find the quick start guide on how to setup your nodes and start training.

Tutorial

API for common tasks in Large Language Models training.

Platform Purpose Estimated Response Time Support Level
Github Issues Bug reports, feature requests, install issues, usage issues, etc. < 1 day Higgsfield Team
Twitter For staying up-to-date on new features. Daily Higgsfield Team
Website Discussion, news. < 2 days Higgsfield Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

higgsfield-0.0.3.tar.gz (318.8 kB view details)

Uploaded Source

Built Distribution

higgsfield-0.0.3-py3-none-any.whl (334.0 kB view details)

Uploaded Python 3

File details

Details for the file higgsfield-0.0.3.tar.gz.

File metadata

  • Download URL: higgsfield-0.0.3.tar.gz
  • Upload date:
  • Size: 318.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.0 Linux/6.2.0-1012-azure

File hashes

Hashes for higgsfield-0.0.3.tar.gz
Algorithm Hash digest
SHA256 2f8f426c7e9dcdabedd4f31fb946afcb79dfa891434f14f1ae896e9265c8d6ff
MD5 5a43175d7fb0d3ec823672b9a40adf62
BLAKE2b-256 5fa2699912dd5a73bd2c19dff29b634a471b90d249d7edb0c27b2ee4beaaa6b8

See more details on using hashes here.

File details

Details for the file higgsfield-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: higgsfield-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 334.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.0 Linux/6.2.0-1012-azure

File hashes

Hashes for higgsfield-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3afcc98b6cc8e45320c905d95bc7c26dc67004d6597c545e3917a9b563f4e988
MD5 13a7538ddf2e727fb437c2da6a864a1e
BLAKE2b-256 d712e09cbf9320021c40f60321deaccb511333e5645b547ea1be65e584fd3a86

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page