Skip to main content

Microlib for the Falcon LLM

Project description

LLM Falcon model

llm_falcon_model allows you to run a part of a Falcon model as a standalone PyTorch module. This enables you to run in distributed mode, using even old GPUs with less memory.

It only contains code needed for inference. The only dependencies are torch, tokenizers, and llm_sepweight.

The original implementation is available here.

Use it when you cannot fit the whole Falcon model into memory. If you have multiple old GPUs with less memory, you can run different parts of the Falcon model on each of them and when you make them communicate (using for example socket_rpc), you can run the full model on multiple heterogeneous hosts. For example, if you have 4 old gaming PCs with a 3090 card (~6000$), you can run Falcon 40B real-time (5-6 tokens/s)

You can also use it when you want to run Falcon on a large number of inputs and have insufficient memory for the model. You can serialize the intermediary results for all inputs and then continue with the next layers

Install with:

pip install llm_falcon_model

Downloads PyPi version PyPI license

Overview

The most important methods of this microlib are:

  1. llm_falcon_model.load_tokenizer() - which loads an instance of the Tokenizer for the models
  2. llm_falcon_model.init_part(model_name, spec, device) - which creates a part of a Falcon model, by a given name (7b, 40b or 180b), part specification (which layers you want to load, see sepweight part spec) and a PyTorch device.
  3. llm_falcon_model.generate - which allows you to generate text based on a prompt.
  4. llm_falcon_model.score_batch - which allows you to score a bunch of possible continuations based on a prompt.
  5. llm_falcon_model.run_part - which allows you to run a part of Falcon in a distributed mode using socket_rpc

Quick example

import torch
import llm_falcon_model

tokenizer = llm_falcon_model.load_tokenizer()

separated_weights_path = '<PATH TO SEPARATED WEIGHTS>'

model = llm_falcon_model.init_part(
    model_name='40b',
    spec='b 0-12', # Load begin and layers 0 to 12
    device='cuda:0'
)

input_text = "The world chess champion Magnus Carlsen"
input_ids = tokenizer.encode(input_text).ids
batch = torch.tensor(input_ids).unsqueeze(0)
x = model(batch)

# x is now the result after end layer 12, shaped:
# torch.Size([1, 7, 8192])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_falcon_model-0.7.0.tar.gz (795.6 kB view details)

Uploaded Source

Built Distribution

llm_falcon_model-0.7.0-py3-none-any.whl (809.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_falcon_model-0.7.0.tar.gz.

File metadata

  • Download URL: llm_falcon_model-0.7.0.tar.gz
  • Upload date:
  • Size: 795.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for llm_falcon_model-0.7.0.tar.gz
Algorithm Hash digest
SHA256 9a7019a9feb82569d3290d33c2582a404a529483a0c26a5499b491c85e64c8f0
MD5 18e966344ac9c7a26b2e74fdb75b1d3b
BLAKE2b-256 c1246560aeef959ad8477337a3fbc10dff2fd20eac42454305c8eb740042b3fe

See more details on using hashes here.

File details

Details for the file llm_falcon_model-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_falcon_model-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a74362aa748f6d531d6b5948623ac2e46b3ccaf8bed0f61fa2ef6df0465b59d1
MD5 1f713ed622e73670708b20e611a98de4
BLAKE2b-256 d008686b735e59d465e13c06967d903dd41907890c9db3f028e1f049f8f2a83b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page