🤖 Train your LLMs with ease and fun.

These details have not been verified by PyPI

Project links

Project description

Quick Start
Installation
Data Preparation
- Option 1: Default FineWeb Dataset
- Option 2: Custom Data
Model Requirements
Training Guide
- LLMTrainer Parameters
- Training Parameters
Contributing

`llm_trainer` in 5 Lines of Code

from llm_trainer import create_dataset, LLMTrainer

create_dataset(save_dir="data")   # Generate the default FineWeb dataset
model = ...                       # Define or load your model (GPT, xLSTM, Mamba...)
trainer = LLMTrainer(model)       # Initialize trainer with default settings
trainer.train(data_dir="data")    # Start training on the dataset

🔴 YouTube Video: Train LLMs in code, spelled out

[!NOTE] Explore usage examples

Installation

$ pip install llm-trainer

How to Prepare Data

Option 1: Use the Default FineWeb Dataset

from llm_trainer import create_dataset

create_dataset(save_dir="data",         # Where to save created dataset
               chunks_limit=1_500,      # Maximum number of files (chunks) with tokens to create
               chunk_size=int(1e6))     # Number of tokens per chunk

Option 2: Use your own data

Your dataset should be structured as a JSON array, where each entry contains a "text" field. You can store your data in one or multiple JSON files.

Example JSON file:

[
   {"text": "Learn about LLMs: https://www.youtube.com/@_NickTech"},
   {"text": "Open-source python library to train LLMs: https://github.com/Skripkon/llm_trainer."},
   {"text": "My name is Nikolay Skripko. Hello from Russia (2025)."}
]

Run the following code to convert your JSON files into a tokenized dataset:

from llm_trainer import create_dataset_from_json

create_dataset_from_json(save_dir="data",        # Where to save created dataset
                         json_dir="json_files",  # Path to your JSON files
                         chunks_limit = 1_500,   # Maximum number of files (chunks) with tokens to create
                         chunk_size=int(1e6))    # Number of tokens per chunk

Which Models Are Valid?

You can train ANY LLM that expects a tensor X with shape (batch_size, context_window) as input and returns logits during the forward pass.

How To Start Training?

You need to create an LLMTrainer object and call .train() on it. Read about its parameters below:

`LLMTrainer()` parameters

model:        torch.nn.Module = None,                      # The neural network model to train  
optimizer:    torch.optim.Optimizer = None,                # Optimizer responsible for updating model weights  
scheduler:    torch.optim.lr_scheduler.LRScheduler = None, # Learning rate scheduler for dynamic adjustment
tokenizer:    PreTrainedTokenizer | AutoTokenizer = None   # Tokenizer for generating text (used if verbose > 0 during training)
model_returns_logits: bool = False                         # Whether model(X) returns logits or an object with an attribute `logits`

You must specify only the model. The other attributes are optional and will be set to default values if not specified.

`LLMTrainer.train()` Parameters

Parameter	Type	Description	Default value
`max_steps`	`int`	The maximum number of training steps	5,000
`save_each_n_steps`	`int`	The interval of steps at which to save model checkpoints	1,000
`print_logs_each_n_steps`	`int`	The interval of steps at which to print training logs	1
`BATCH_SIZE`	`int`	The total batch size for training	256
`MINI_BATCH_SIZE`	`int`	The mini-batch size for gradient accumulation	16
`context_window`	`int`	The context window size for the data loader	128
`data_dir`	`str`	The directory containing the training data	"data"
`logging_file`	`Union[str, None]`	The file path for logging training metrics	"logs_training.csv"
`generate_each_n_steps`	`int`	The interval of steps at which to generate and print text samples	200
`prompt`	`str`	Beginning of the sentence that the model will continue	"Once upon a time"
`save_dir`	`str`	The directory to save model checkpoints	"checkpoints"

Every parameter has a default value, so you can start training simply by calling LLMTrainer.train().

To contribute (instructions for Linux)

Fork the repository.
Set up environment:

python3 -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install

Make changes
Apply linter

$ pip install pylint==3.3.5
$ pylint $(git ls-files '*.py')

Run tests locally

pip install pytest
poetry run pytest

Commit and push your changes
Create a pull request from your fork to the main repository

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Apr 24, 2025

1.0.3

Apr 24, 2025

1.0.2

Apr 9, 2025

1.0.1

Apr 8, 2025

1.0.0

Apr 7, 2025

0.1.25

Apr 3, 2025

0.1.22

Mar 21, 2025

0.1.16

Mar 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_trainer-1.0.4.tar.gz (16.5 kB view details)

Uploaded Apr 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_trainer-1.0.4-py3-none-any.whl (17.4 kB view details)

Uploaded Apr 24, 2025 Python 3

File details

Details for the file llm_trainer-1.0.4.tar.gz.

File metadata

Download URL: llm_trainer-1.0.4.tar.gz
Upload date: Apr 24, 2025
Size: 16.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for llm_trainer-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`76ecd894eab5524e730a8a969b06a403e9ddc858995ca51748d7dcc392dba72d`
MD5	`53d99a05848aca132e29f69295c956cf`
BLAKE2b-256	`f0de5f5339db1b27a8ff64ec225c2c0b9ff58fa6fcb12f97bebb5b3b5169bb46`

See more details on using hashes here.

File details

Details for the file llm_trainer-1.0.4-py3-none-any.whl.

File metadata

Download URL: llm_trainer-1.0.4-py3-none-any.whl
Upload date: Apr 24, 2025
Size: 17.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for llm_trainer-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4136ece2c642be133f2963c0c001848538081acbb8e4ecfae859662a34b0280f`
MD5	`f4e9edd01f4e7d2b1d766c369b8784ce`
BLAKE2b-256	`0434437bac08695f1ad442cca9e84dd03f444a91d2ed8bb1d809e1cb9c8d231b`

See more details on using hashes here.

llm-trainer 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

`llm_trainer` in 5 Lines of Code

Installation

How to Prepare Data

Option 1: Use the Default FineWeb Dataset

Option 2: Use your own data

Which Models Are Valid?

How To Start Training?

`LLMTrainer()` parameters

`LLMTrainer.train()` Parameters

To contribute (instructions for Linux)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

llm-trainer 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

llm_trainer in 5 Lines of Code

Installation

How to Prepare Data

Option 1: Use the Default FineWeb Dataset

Option 2: Use your own data

Which Models Are Valid?

How To Start Training?

LLMTrainer() parameters

LLMTrainer.train() Parameters

To contribute (instructions for Linux)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`llm_trainer` in 5 Lines of Code

`LLMTrainer()` parameters

`LLMTrainer.train()` Parameters