Skip to main content

Language Model Training Library

Project description

Python PyTorch

alt text

A python package for training Language Models from scratch with few lines of code

LangTrain is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.

Installation

Stable Version

pip install langtrain

Development Version

pip install git+https://github.com/sayedshaun/langtrain.git

Usage

Quick Start

import langtrain as lt

data_path = "data_directory"
tokenizer = lt.tokenizer.SentencePieceTokenizer(data_path, vocab_size=5000)
dataset = lt.dataset.SimpleCausalDataset(data_path, tokenizer, n_ctx=512)
model = lt.model.LlamaModel(
    lt.model.LlamaConfig(
        vocab_size=tokenizer.vocab_size,
        hidden_size=512,
        hidden_layers=8,
        num_heads=8,
        dropout=0.2,
        norm_epsilon=1e-6,
        max_seq_len=dataset.n_ctx,
    )
)
train_config=lt.config.TrainingConfig(
    epochs=5,
    batch_size=4,
    learning_rate=1e-4,
    device="cuda",
    precision="fp16",
)
trainer = lt.trainer.Trainer(
    model=model,
    train_config=train_config,
    dataset=dataset,
    tokenizer=tokenizer,
    collate_fn=lt.utils.collate_fn,
    model_name="nano-llama",
)
trainer.train()

Pretrained Detailes:

Once the model is trained the pretrained dicretory will looks like this:

nano-llama/
    ├── /checkpoint-200
    ├── train_config.yaml
    ├── model_config.yaml
    ├── pytorch_model.pt
    ├── VOCAB.model
    └── VOCAB.vocab

Inference

import langtrain as lt

tokenizer = lt.tokenizer.Tokenizer.from_pretrained("nano-llama")
model = lt.model.LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)

More tutorial can be found here

Available Model Architectures to train

Architecture Source
GPT OpenAI GPT
LLaMA Meta LLaMA
BERT Google BERT
VIT Vision Transformer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langtrain-0.0.4.tar.gz (421.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langtrain-0.0.4-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file langtrain-0.0.4.tar.gz.

File metadata

  • Download URL: langtrain-0.0.4.tar.gz
  • Upload date:
  • Size: 421.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for langtrain-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b7727cecd693f2326126135563ad8da520e993149a24d9b59f306439d95ff73b
MD5 1dd42b77180affe70e1e22ef8f2ead70
BLAKE2b-256 9c09915a31736a82a3c7bcfc815b399d54e6115eb78d7f2f69b8b2f35b0b5ff8

See more details on using hashes here.

File details

Details for the file langtrain-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: langtrain-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for langtrain-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f2fc7474dc3d20437ab8b7fbd353aaea6dc58b8d59923452d9ba7217449ef6bf
MD5 cbf1f052ce0eff8fee25fd665f831a6e
BLAKE2b-256 9b790f79111f12cb09686cc609713b69606861b162dc77ca6211c37fb9ac5831

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page