Microlib for the Falcon LLM
Project description
LLM Falcon model
llm_falcon_model
allows you to run a part of a Falcon model as a standalone PyTorch module.
This enables you to run in distributed mode, using even old GPUs with less memory.
It only contains code needed for inference.
The only dependencies are torch
, tokenizers
, and llm_sepweight
.
The original implementation is available here.
Use it when you cannot fit the whole Falcon model into memory. If you have multiple
old GPUs with less memory, you can run different parts of the Falcon model on each of them and when
you make them communicate (using for example socket_rpc
), you can run the full model on multiple
heterogeneous hosts. For example, if you have 4 old gaming PCs with a 3090 card (~6000$), you can run Falcon 40B
real-time (5-6 tokens/s)
You can also use it when you want to run Falcon on a large number of inputs and have insufficient memory for the model. You can serialize the intermediary results for all inputs and then continue with the next layers
Install with:
pip install llm_falcon_model
Overview
The most important methods of this microlib are:
llm_falcon_model.load_tokenizer()
- which loads an instance of theTokenizer
for the modelsllm_falcon_model.init_part(model_name, spec, device)
- which creates a part of a Falcon model, by a given name (7b
,40b
or180b
), part specification (which layers you want to load, see sepweight part spec) and a PyTorch device.llm_falcon_model.generate
- which allows you to generate text based on a prompt.llm_falcon_model.score_batch
- which allows you to score a bunch of possible continuations based on a prompt.llm_falcon_model.run_part
- which allows you to run a part of Falcon in a distributed mode usingsocket_rpc
Quick example
import torch
import llm_falcon_model
tokenizer = llm_falcon_model.load_tokenizer()
separated_weights_path = '<PATH TO SEPARATED WEIGHTS>'
model = llm_falcon_model.init_part(
model_name='40b',
spec='b 0-12', # Load begin and layers 0 to 12
device='cuda:0'
)
input_text = "The world chess champion Magnus Carlsen"
input_ids = tokenizer.encode(input_text).ids
batch = torch.tensor(input_ids).unsqueeze(0)
x = model(batch)
# x is now the result after end layer 12, shaped:
# torch.Size([1, 7, 8192])
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llm_falcon_model-0.7.0.tar.gz
.
File metadata
- Download URL: llm_falcon_model-0.7.0.tar.gz
- Upload date:
- Size: 795.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a7019a9feb82569d3290d33c2582a404a529483a0c26a5499b491c85e64c8f0 |
|
MD5 | 18e966344ac9c7a26b2e74fdb75b1d3b |
|
BLAKE2b-256 | c1246560aeef959ad8477337a3fbc10dff2fd20eac42454305c8eb740042b3fe |
File details
Details for the file llm_falcon_model-0.7.0-py3-none-any.whl
.
File metadata
- Download URL: llm_falcon_model-0.7.0-py3-none-any.whl
- Upload date:
- Size: 809.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a74362aa748f6d531d6b5948623ac2e46b3ccaf8bed0f61fa2ef6df0465b59d1 |
|
MD5 | 1f713ed622e73670708b20e611a98de4 |
|
BLAKE2b-256 | d008686b735e59d465e13c06967d903dd41907890c9db3f028e1f049f8f2a83b |