Skip to main content

Easy Language Model Training Library

Project description

Python PyTorch

alt text

A python package for training Language Models from scratch with few lines of code

EasyLM is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.

Installation

Stable Version

pip install langtrain

Development Version

pip install git+https://github.com/sayedshaun/langtrain.git

Usage

Training

from langtrain.model import LlamaModel
from langtrain.data import IterableCausalDataset
from langtrain.tokenizer import Tokenizer
from langtrain.config import TrainingConfig, LlamaConfig
from langtrain.trainer import Trainer
from langtrain.utils import trainable_parameters


data_path = "data_directory"
tokenizer = Tokenizer(data_path, vocab_size=5000)
dataset = IterableCausalDataset(data_path, tokenizer, n_ctx=50, batch=10000)
model = LlamaModel(
    LlamaConfig(
        vocab_size=tokenizer.vocab_size,
        hidden_size=128,
        num_heads=4,
        num_layers=4,
        dropout=0.1,
        max_seq_len=50,
        norm_epsilon=1e-5
    )
)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    model_name="nano-llama",
    collate_fn=IterableCausalDataset.collate_fn,
    config=TrainingConfig(
        train_data=dataset,
        learning_rate=1e-4,
        epochs=5,
        batch_size=8,
        device="cuda",
        logging_steps=100,
        num_checkpoints=3,
        report_to_wandb=True,
    )
)
print(trainable_parameters(model))
trainer.from_checkpoint("nano-llama/checkpoint-200")
trainer.train()

Pretrained Detailes:

Once the model is trained the pretrained dicretory will looks like this:

nano-llama/
    ├── /checkpoint-200
    ├── train_config.yaml
    ├── model_config.yaml
    ├── pytorch_model.pt
    ├── VOCAB.model
    └── VOCAB.vocab

Inference

from langtrain.model import LlamaModel
from langtrain.tokenizer import Tokenizer

tokenizer = Tokenizer.from_pretrained("nano-llama")
model = LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langtrain-0.0.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langtrain-0.0.1-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file langtrain-0.0.1.tar.gz.

File metadata

  • Download URL: langtrain-0.0.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for langtrain-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b1112126f0a52d198eb00b0520aaf798d9b7c2f802eff221afc201a4989171e1
MD5 a738d1b86e16cdf654cb6238eefeca2c
BLAKE2b-256 54ecf463b8f59caadb59e63794f93c0b07ac8b7829058f32441f3eb5773f91e1

See more details on using hashes here.

File details

Details for the file langtrain-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: langtrain-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for langtrain-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7fed4b7dc6754c038d164559f264ffc31c9e3a7ace70b2ea0a6ebddae6f9fe42
MD5 3bff1a610e6d73ade8f3aa37cd8548cf
BLAKE2b-256 fb94e066ae95fa3301b86b9f0b520e8133318a0a14e4ebb8d487c40778cf2f58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page