Easy Language Model Training Library
Project description
A python package for training Language Models from scratch with few lines of code
EasyLM is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.
Installation
Stable Version
pip install langtrain
Development Version
pip install git+https://github.com/sayedshaun/langtrain.git
Usage
Training
from langtrain.model import LlamaModel
from langtrain.data import IterableCausalDataset
from langtrain.tokenizer import Tokenizer
from langtrain.config import TrainingConfig, LlamaConfig
from langtrain.trainer import Trainer
from langtrain.utils import trainable_parameters
data_path = "data_directory"
tokenizer = Tokenizer(data_path, vocab_size=5000)
dataset = IterableCausalDataset(data_path, tokenizer, n_ctx=50, batch=10000)
model = LlamaModel(
LlamaConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=128,
num_heads=4,
num_layers=4,
dropout=0.1,
max_seq_len=50,
norm_epsilon=1e-5
)
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
model_name="nano-llama",
collate_fn=IterableCausalDataset.collate_fn,
config=TrainingConfig(
train_data=dataset,
learning_rate=1e-4,
epochs=5,
batch_size=8,
device="cuda",
logging_steps=100,
num_checkpoints=3,
report_to_wandb=True,
)
)
print(trainable_parameters(model))
trainer.from_checkpoint("nano-llama/checkpoint-200")
trainer.train()
Pretrained Detailes:
Once the model is trained the pretrained dicretory will looks like this:
nano-llama/
├── /checkpoint-200
├── train_config.yaml
├── model_config.yaml
├── pytorch_model.pt
├── VOCAB.model
└── VOCAB.vocab
Inference
from langtrain.model import LlamaModel
from langtrain.tokenizer import Tokenizer
tokenizer = Tokenizer.from_pretrained("nano-llama")
model = LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
langtrain-0.0.1.tar.gz
(19.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
langtrain-0.0.1-py3-none-any.whl
(23.0 kB
view details)
File details
Details for the file langtrain-0.0.1.tar.gz.
File metadata
- Download URL: langtrain-0.0.1.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1112126f0a52d198eb00b0520aaf798d9b7c2f802eff221afc201a4989171e1
|
|
| MD5 |
a738d1b86e16cdf654cb6238eefeca2c
|
|
| BLAKE2b-256 |
54ecf463b8f59caadb59e63794f93c0b07ac8b7829058f32441f3eb5773f91e1
|
File details
Details for the file langtrain-0.0.1-py3-none-any.whl.
File metadata
- Download URL: langtrain-0.0.1-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fed4b7dc6754c038d164559f264ffc31c9e3a7ace70b2ea0a6ebddae6f9fe42
|
|
| MD5 |
3bff1a610e6d73ade8f3aa37cd8548cf
|
|
| BLAKE2b-256 |
fb94e066ae95fa3301b86b9f0b520e8133318a0a14e4ebb8d487c40778cf2f58
|