Language Model Training Library
Project description
A python package for training Language Models from scratch with few lines of code
LangTrain is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.
Installation
Stable Version
pip install langtrain
Development Version
pip install git+https://github.com/sayedshaun/langtrain.git
Usage
Training
import langtrain as lt
data_path = "data_directory"
tokenizer = lt.tokenizer.Tokenizer(data_path, vocab_size=5000)
dataset = lt.dataset.CausalDataset(data_path, tokenizer, n_ctx=512)
model = lt.model.LlamaModel(
lt.model.LlamaConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=128,
num_heads=4,
num_layers=4,
dropout=0.1,
max_seq_len=50,
norm_epsilon=1e-5
)
)
trainer = lt.trainer.Trainer(
config=lt.config.TrainingConfig(
train_data=dataset,
learning_rate=1e-4,
epochs=5,
batch_size=8,
device="cuda",
logging_steps=100,
num_checkpoints=3,
report_to_wandb=True,
distributed_backend="ddp"
)
model=model,
tokenizer=tokenizer,
model_name="nano-llama",
collate_fn=lt.utils.collate_fn,
)
print(lt.utils.trainable_parameters(model))
trainer.from_checkpoint("nano-llama/checkpoint-200")
trainer.train()
Pretrained Detailes:
Once the model is trained the pretrained dicretory will looks like this:
nano-llama/
├── /checkpoint-200
├── train_config.yaml
├── model_config.yaml
├── pytorch_model.pt
├── VOCAB.model
└── VOCAB.vocab
Inference
import langtrain as lt
tokenizer = lt.tokenizer.Tokenizer.from_pretrained("nano-llama")
model = lt.model.LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)
Available Model Architectures to train
| Model Architecture | Source Repository |
|---|---|
| GPT | OpenAI GPT |
| LLaMA | Meta LLaMA |
| BERT | Google BERT |
| VIT | Vision Transformer |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
langtrain-0.0.3.tar.gz
(412.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
langtrain-0.0.3-py3-none-any.whl
(24.5 kB
view details)
File details
Details for the file langtrain-0.0.3.tar.gz.
File metadata
- Download URL: langtrain-0.0.3.tar.gz
- Upload date:
- Size: 412.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60b863c9ebf64261255e125c43094ff9a365debad198513d0d065acab6b8b966
|
|
| MD5 |
a7307c8e97e8468d9c3d0be6b61d5e25
|
|
| BLAKE2b-256 |
18bfcbb072f7e3ad36766c70ed41313f2afad3f888b67419e3097e9dced6f9d4
|
File details
Details for the file langtrain-0.0.3-py3-none-any.whl.
File metadata
- Download URL: langtrain-0.0.3-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
995905e42b2dd0a136cdb19f1cb1a6cb59bf2748651b7228fff56106c59dec8a
|
|
| MD5 |
9180a1b8012deb06ed2b69bb12b5d36e
|
|
| BLAKE2b-256 |
567ce3040666576069707fdef6469759f0b229549d94542eae8c0e8f95d443f6
|