Skip to main content

Memory-based language modeling. Eco-friendly CPU-based LLMs

Project description

Olifant: Memory-based language modeling

This repository contains instructions and code to install, train and run memory-based LLMs.

Looking for an LLM that is relatively eco-friendly? Memory-based language models rely on CPUs. No GPUs or TPUs are required for training or inference. Training memory-based language models is costly in terms of RAM, but not in terms of time or computing resources. Running a memory-based language model in autoregressive GPT-style mode also costs RAM, but still relies on CPUs and is reasonably fast as well, depending on the selected approximation of k-nearest neighbor classification.

Olifant is our implementation of memory-based language modeling; it is the Dutch word for elephant. To quote Wikipedia, "Most contemporary ethologists view the elephant as one of the world's most intelligent animals. Elephants manifest a wide variety of behaviors, including those associated with grief, learning, mimicry, playing, altruism, tool use, compassion, cooperation, self-awareness, memory, and communication."

Installation

Download and install via pip:

% pip install olifant

Alternatively, for the latest development version, clone this repository and run pip install ..

This will automatically pull in and install the Python bindings to TiMBL, python3-timbl (wheels are available for Python versions 3.10, 3.11, and 3.12 on systems with glibc 2.28 or higher; on macOS, installation only works with Python 3.13 currently):

Olifant also relies on the command-line version of TiMBL memory-based classification engine for training.

Install TiMBL on Debian/Ubuntu systems with

$ sudo apt install timbl

On Alpine Linux:

# apk add timbl

On macOS with brew, invoke

% brew install timbl

Note: Windows is not supported

Usage

Tokenization

Training Olifant assumes that you have a tokenizer and a raw-text training set textfile. The tokenizer will have to be the same tokenizer used for testing. First, the text is tokenized using bert-base-cased (a standard LLM tokenizer from Hugging Face; we will need to use the same tokenizer in later steps). Edit tok.py if you want to use a different tokenizer.

% olifant-tok textfile

This creates a file textfile_tok which then needs to be converted to a fixed-width instance base to make it suitable training data for TiMBL. The example works with an input buffer of 16 tokens, which in current LLM terms is a very small input buffer. At inference time, however, single instances are incrementally stored in memory, becoming available for the next steps in inference in the internal "long-term" memory of the memory-based classifier.

% olifant-continuous-windowing textfile_tok 4 > textfile_tok.l4r0

This creates textfile_tok.l4r0, creating 4-token windowed instances with the next token as the label to be classified and all previous tokens as context. Empty lines in the original tokenized text signify the reset of the context window (padded with "_").

Training

Training can then be invoked by calling TiMBL. This can take a while and may consume high amounts of RAM.

% timbl -f textfile_tok.l4r0 -a0 +D -I textfile_tok.l4r0.ibase

The end result is textfile_tok.l4r0.ibase, an indexed and compressed instance base suitable for TiMBL classification. In LLM terms, this is the model file that you will need for your favorite LLM inference steps.

The option -a0 means that the training set is compressed losslessly, with compression rates around 10-30%. With -a1, a strong lossy compression is applied, yielding higher compression levels around 90-95%, and considerably faster but less accurate inference.

Fine-tuning

Memory-based language models are natural incremental learners, so any learned model can be complemented by additional fine-tuning from any new training set, creating a new ibase model. This requires a TiMBL invocation similar to the training command; it now includes a previously generated ibase model file as starting point. Assuming you have tokenized and windowed a new training set finetune_tok.l4r0:

% timbl -a0 +D -i textfile_tok.l4r0.ibase -f finetune_tok.l4r0 -I textfile-finetune_tok.l4r0.ibase

Choose your own naming conventions to keep track of trained and finetuned ibase model files. Any ibase file can be the starting point for further finetuning. This also offers a way to do stepwise training with segments of training data under limited RAM conditions.

Inference

Simple GPT-style text completion can be invoked by issuing

% olifant-timbl-llm --classifier textfile-finetune_tok.l4r0 --tokenizer bert-base-cased --timbl_args '-a4 +D' --verbosity 3

This call assumes the presence of textfile-finetune_tok.l4r0.ibase. The arguments passed to the TiMBL engine are '-a4 +D', invoking the so-called TRIBL2 k-NN approximation, a relatively fast approximation of k-nearest neighbor classification. See the TiMBL reference guide for all possible algorithmic variants (-a), the important k parameter (set to 1 by default), and many more options.

You can also run a Jupyter Notebook version:

% jupyter notebook timbl-llm.ipynb

Be sure to adjust the way you load your .ibase model file.

Inference, Hugging Face style

In this Jupyter Notebook you see how Olifant can be run Hugging Face style:

% jupyter notebook timbl-llm-hf.ipynb

An excerpt from this code shows how a TimblHuggingFaceModel is initialized:

    # Initialize the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer)

    # Initialize the Timbl classifier
    classifier = timbl.TimblClassifier(args.classifier, args.timbl_args)
    classifier.load()

    config = AutoConfig.from_pretrained("antalvdb/mblm-chatbot-instruction-prompts-igtree")
    tokenizer.add_special_tokens({'pad_token': '_'})
    tokenizer.pad_token = "_"

    # Initialize the TimblHuggingFaceModel
    model = TimblHuggingFaceModel(config, classifier, tokenizer)

Credits

TiMBL was created 25 years ago by a team that was once the Induction of Linguistic Knowledge group at Tilburg University, the Netherlands; members moved to the Computational Linguistics, Psycholinguistics and Sociolinguistics group at Antwerp University, Belgium, and the Centre for Language and Speech Technology at Radboud University, Nijmegen, the Netherlands. Core developer of TiMBL is Ko van der Sloot. Other contributors were Walter Daelemans, Antal van den Bosch, Jakub Zavrel, Peter Berck, Maarten van Gompel, and many more people credited fully in the TiMBL reference guide.

Memory-based language modeling was first described in

Van den Bosch, A. (2005). Scalable classification-based word prediction and confusible correction. Traitement Automatique des Langues, 46:2, pp. 39-63.

Olifant is a re-implementation of WOPR, a C++ version of a TiMBL-based word predictor developed by Peter Berck, funded under the NWO Vici project "Memory Models of Language" (2006-2011) awarded to Antal van den Bosch. Peter Berck wrote a PhD thesis on the topic. Later, work on memory-based word prediction was carried forwards by Wessel Stoop and Maarten van Gompel (Valkuil, Colibri Core). See this interactive publication on autocompletion and next-word prediction.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olifant-0.2.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

olifant-0.2.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file olifant-0.2.0.tar.gz.

File metadata

  • Download URL: olifant-0.2.0.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for olifant-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8430b6a95d9a9fa072ed1a8ae006ba4f78e3f825a4ac6f8dd888b29de28a1b3e
MD5 52b66485684081348661ae68d6706d63
BLAKE2b-256 6d5507d98fe7c0f8d546924f8edcdd73f712a3834b485937f534f021b568f33c

See more details on using hashes here.

File details

Details for the file olifant-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: olifant-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for olifant-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c4f93ecb2886a2da09b2f7c9c79fa47f6ca38bf965ba59c890353cbf7abb860f
MD5 3aaee36f6512284157a4c3c68ae6abb5
BLAKE2b-256 33f1d435d6fc09f8bac07ba0a2aac0c929c8b5f1d9644c90f92af32b980e8c3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page