🤖 Train your LLMs with ease and fun.
Project description
llm_trainer in 5 Lines of Code
from llm_trainer import create_dataset, LLMTrainer
create_dataset(save_dir="data") # Generate the default FineWeb dataset
model = ... # Define or load your model (GPT, xLSTM, Mamba...)
trainer = LLMTrainer(model) # Initialize trainer with default settings
trainer.train(data_dir="data") # Start training on the dataset
🔴 YouTube Video: Train LLMs in code, spelled out
[!NOTE] Explore usage examples
Installation
$ pip install llm-trainer
How to Prepare Data
Option 1: Use the Default FineWeb Dataset
from llm_trainer import create_dataset
create_dataset(save_dir="data", # Where to save created dataset
chunks_limit=1_500, # Maximum number of files (chunks) with tokens to create
chunk_size=int(1e6)) # Number of tokens per chunk
Option 2: Use your own data
- Your dataset should be structured as a JSON array, where each entry contains a "text" field. You can store your data in one or multiple JSON files.
Example JSON file:
[
{"text": "Learn about LLMs: https://www.youtube.com/@_NickTech"},
{"text": "Open-source python library to train LLMs: https://github.com/Skripkon/llm_trainer."},
{"text": "My name is Nikolay Skripko. Hello from Russia (2025)."}
]
- Run the following code to convert your JSON files into a tokenized dataset:
from llm_trainer import create_dataset_from_json
create_dataset_from_json(save_dir="data", # Where to save created dataset
json_dir="json_files", # Path to your JSON files
chunks_limit = 1_500, # Maximum number of files (chunks) with tokens to create
chunk_size=int(1e6)) # Number of tokens per chunk
Which Models Are Valid?
You can train ANY LLM that expects a tensor X with shape (batch_size, context_window) as input and returns logits during the forward pass.
How To Start Training?
You need to create an LLMTrainer object and call .train() on it. Read about its parameters below:
LLMTrainer() parameters
model: torch.nn.Module = None, # The neural network model to train
optimizer: torch.optim.Optimizer = None, # Optimizer responsible for updating model weights
scheduler: torch.optim.lr_scheduler.LRScheduler = None, # Learning rate scheduler for dynamic adjustment
tokenizer: PreTrainedTokenizer | AutoTokenizer = None # Tokenizer for generating text (used if verbose > 0 during training)
model_returns_logits: bool = False # Whether model(X) returns logits or an object with an attribute `logits`
You must specify only the model. The other attributes are optional and will be set to default values if not specified.
LLMTrainer.train() Parameters
| Parameter | Type | Description | Default value |
|---|---|---|---|
max_steps |
int |
The maximum number of training steps | 5,000 |
save_each_n_steps |
int |
The interval of steps at which to save model checkpoints | 1,000 |
print_logs_each_n_steps |
int |
The interval of steps at which to print training logs | 1 |
BATCH_SIZE |
int |
The total batch size for training | 256 |
MINI_BATCH_SIZE |
int |
The mini-batch size for gradient accumulation | 16 |
context_window |
int |
The context window size for the data loader | 128 |
data_dir |
str |
The directory containing the training data | "data" |
logging_file |
Union[str, None] |
The file path for logging training metrics | "logs_training.csv" |
generate_each_n_steps |
int |
The interval of steps at which to generate and print text samples | 200 |
prompt |
str |
Beginning of the sentence that the model will continue | "Once upon a time" |
save_dir |
str |
The directory to save model checkpoints | "checkpoints" |
Every parameter has a default value, so you can start training simply by calling LLMTrainer.train().
To contribute
- Fork the repository.
- Make changes.
- Apply linter.
$ pip install pylint==3.3.5
$ pylint $(git ls-files '*.py')
- Commit and push your changes.
- Create a pull request from your fork to the main repository.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_trainer-1.0.1.tar.gz.
File metadata
- Download URL: llm_trainer-1.0.1.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14a02bd0916ea9aea3177744f99fde1951eb42150f41247083eb6004db44f943
|
|
| MD5 |
efa0aa41cc4914672e0be9b412ba39ae
|
|
| BLAKE2b-256 |
6962fe3c304a17ece12de3212d36c8a6346b288d731fbe1a1f928b018039a5ae
|
File details
Details for the file llm_trainer-1.0.1-py3-none-any.whl.
File metadata
- Download URL: llm_trainer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9bc4711ca854fc92f2754cc1b74457fc70616e868105775276bfb560e947e87
|
|
| MD5 |
650e660613bcae8b187b9287657bbba0
|
|
| BLAKE2b-256 |
671311da20808697fdb4d7b42cc35b4f1b6ad7a71eba1ac4e439aa3ea1c490a1
|