Language Model Training Library
Project description
A python package for training Language Models from scratch with few lines of code
LangTrain is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.
Installation
Stable Version
pip install langtrain
Development Version
pip install git+https://github.com/sayedshaun/langtrain.git
Usage
Quick Start
import langtrain as lt
data_path = "data_directory"
tokenizer = lt.tokenizer.SentencePieceTokenizer(data_path, vocab_size=5000)
dataset = lt.dataset.SimpleCausalDataset(data_path, tokenizer, n_ctx=512)
model = lt.model.LlamaModel(
lt.model.LlamaConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=512,
hidden_layers=8,
num_heads=8,
dropout=0.2,
norm_epsilon=1e-6,
max_seq_len=dataset.n_ctx,
)
)
train_config=lt.config.TrainingConfig(
epochs=5,
batch_size=4,
learning_rate=1e-4,
device="cuda",
precision="fp16",
)
trainer = lt.trainer.Trainer(
model=model,
train_config=train_config,
dataset=dataset,
tokenizer=tokenizer,
collate_fn=lt.utils.collate_fn,
model_name="nano-llama",
)
trainer.train()
Pretrained Detailes:
Once the model is trained the pretrained dicretory will looks like this:
nano-llama/
├── /checkpoint-200
├── train_config.yaml
├── model_config.yaml
├── pytorch_model.pt
├── VOCAB.model
└── VOCAB.vocab
Inference
import langtrain as lt
tokenizer = lt.tokenizer.Tokenizer.from_pretrained("nano-llama")
model = lt.model.LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)
More tutorial can be found here
Available Model Architectures to train
| Architecture | Source |
|---|---|
| GPT | OpenAI GPT |
| LLaMA | Meta LLaMA |
| BERT | Google BERT |
| VIT | Vision Transformer |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
langtrain-0.0.4.tar.gz
(421.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
langtrain-0.0.4-py3-none-any.whl
(31.2 kB
view details)
File details
Details for the file langtrain-0.0.4.tar.gz.
File metadata
- Download URL: langtrain-0.0.4.tar.gz
- Upload date:
- Size: 421.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7727cecd693f2326126135563ad8da520e993149a24d9b59f306439d95ff73b
|
|
| MD5 |
1dd42b77180affe70e1e22ef8f2ead70
|
|
| BLAKE2b-256 |
9c09915a31736a82a3c7bcfc815b399d54e6115eb78d7f2f69b8b2f35b0b5ff8
|
File details
Details for the file langtrain-0.0.4-py3-none-any.whl.
File metadata
- Download URL: langtrain-0.0.4-py3-none-any.whl
- Upload date:
- Size: 31.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2fc7474dc3d20437ab8b7fbd353aaea6dc58b8d59923452d9ba7217449ef6bf
|
|
| MD5 |
cbf1f052ce0eff8fee25fd665f831a6e
|
|
| BLAKE2b-256 |
9b790f79111f12cb09686cc609713b69606861b162dc77ca6211c37fb9ac5831
|