Skip to main content

🤖 Train your LLMs with ease and fun.

Project description

Table of Contents

llm_trainer in 5 Lines of Code

from llm_trainer import create_dataset, LLMTrainer

create_dataset(save_dir="data")   # Generate the default FineWeb dataset
model = ...                       # Define or load your model (GPT, xLSTM, Mamba...)
trainer = LLMTrainer(model)       # Initialize trainer with default settings
trainer.train(data_dir="data")    # Start training on the dataset

🔴 YouTube Video: Train LLMs in code, spelled out

[!NOTE] Explore usage examples

Installation

$ pip install llm-trainer

How to Prepare Data

Option 1: Use the Default FineWeb Dataset

from llm_trainer import create_dataset

create_dataset(save_dir="data",         # Where to save created dataset
               chunks_limit=1_500,      # Maximum number of files (chunks) with tokens to create
               chunk_size=int(1e6))     # Number of tokens per chunk

Option 2: Use your own data

  1. Your dataset should be structured as a JSON array, where each entry contains a "text" field. You can store your data in one or multiple JSON files.

Example JSON file:

[
   {"text": "Learn about LLMs: https://www.youtube.com/@_NickTech"},
   {"text": "Open-source python library to train LLMs: https://github.com/Skripkon/llm_trainer."},
   {"text": "My name is Nikolay Skripko. Hello from Russia (2025)."}
]
  1. Run the following code to convert your JSON files into a tokenized dataset:
from llm_trainer import create_dataset_from_json

create_dataset_from_json(save_dir="data",        # Where to save created dataset
                         json_dir="json_files",  # Path to your JSON files
                         chunks_limit = 1_500,   # Maximum number of files (chunks) with tokens to create
                         chunk_size=int(1e6))    # Number of tokens per chunk 

Which Models Are Valid?

You can train ANY LLM that expects a tensor X with shape (batch_size, context_window) as input and returns logits during the forward pass.

How To Start Training?

You need to create an LLMTrainer object and call .train() on it. Read about its parameters below:

LLMTrainer() parameters

model:        torch.nn.Module = None,                      # The neural network model to train  
optimizer:    torch.optim.Optimizer = None,                # Optimizer responsible for updating model weights  
scheduler:    torch.optim.lr_scheduler.LRScheduler = None, # Learning rate scheduler for dynamic adjustment
tokenizer:    PreTrainedTokenizer | AutoTokenizer = None   # Tokenizer for generating text (used if verbose > 0 during training)
model_returns_logits: bool = False                         # Whether model(X) returns logits or an object with an attribute `logits`

You must specify only the model. The other attributes are optional and will be set to default values if not specified.

LLMTrainer.train() Parameters

Parameter Type Description Default value
max_steps int The maximum number of training steps 5,000
save_each_n_steps int The interval of steps at which to save model checkpoints 1,000
print_logs_each_n_steps int The interval of steps at which to print training logs 1
BATCH_SIZE int The total batch size for training 256
MINI_BATCH_SIZE int The mini-batch size for gradient accumulation 16
context_window int The context window size for the data loader 128
data_dir str The directory containing the training data "data"
logging_file Union[str, None] The file path for logging training metrics "logs_training.csv"
generate_each_n_steps int The interval of steps at which to generate and print text samples 200
prompt str Beginning of the sentence that the model will continue "Once upon a time"
save_dir str The directory to save model checkpoints "checkpoints"

Every parameter has a default value, so you can start training simply by calling LLMTrainer.train().

To contribute (instructions for Linux)

  1. Fork the repository.
  2. Set up environment:
python3 -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install
  1. Make changes
  2. Apply linter
$ pip install pylint==3.3.5
$ pylint $(git ls-files '*.py')
  1. Run tests locally
pip install pytest
poetry run pytest
  1. Commit and push your changes
  2. Create a pull request from your fork to the main repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_trainer-1.0.4.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_trainer-1.0.4-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_trainer-1.0.4.tar.gz.

File metadata

  • Download URL: llm_trainer-1.0.4.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for llm_trainer-1.0.4.tar.gz
Algorithm Hash digest
SHA256 76ecd894eab5524e730a8a969b06a403e9ddc858995ca51748d7dcc392dba72d
MD5 53d99a05848aca132e29f69295c956cf
BLAKE2b-256 f0de5f5339db1b27a8ff64ec225c2c0b9ff58fa6fcb12f97bebb5b3b5169bb46

See more details on using hashes here.

File details

Details for the file llm_trainer-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: llm_trainer-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for llm_trainer-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4136ece2c642be133f2963c0c001848538081acbb8e4ecfae859662a34b0280f
MD5 f4e9edd01f4e7d2b1d766c369b8784ce
BLAKE2b-256 0434437bac08695f1ad442cca9e84dd03f444a91d2ed8bb1d809e1cb9c8d231b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page