Skip to main content

Language Model Training Library

Project description

Python PyTorch

alt text

A python package for training Language Models from scratch with few lines of code

LangTrain is a python package for training Language Models from scratch. It provides a simple interface to train large Language Models from scratch with few lines of code.

Installation

Stable Version

pip install langtrain

Development Version

pip install git+https://github.com/sayedshaun/langtrain.git

Usage

Training

import langtrain as lt

data_path = "data_directory"
tokenizer = lt.tokenizer.Tokenizer(data_path, vocab_size=5000)
dataset = lt.dataset.CausalDataset(data_path, tokenizer, n_ctx=512)
model = lt.model.LlamaModel(
    lt.model.LlamaConfig(
        vocab_size=tokenizer.vocab_size,
        hidden_size=128,
        num_heads=4,
        num_layers=4,
        dropout=0.1,
        max_seq_len=50,
        norm_epsilon=1e-5
    )
)
trainer = lt.trainer.Trainer(
    config=lt.config.TrainingConfig(
        train_data=dataset,
        learning_rate=1e-4,
        epochs=5,
        batch_size=8,
        device="cuda",
        logging_steps=100,
        num_checkpoints=3,
        report_to_wandb=True,
        distributed_backend="ddp"
    )
    model=model,
    tokenizer=tokenizer,
    model_name="nano-llama",
    collate_fn=lt.utils.collate_fn,
)
print(lt.utils.trainable_parameters(model))
trainer.from_checkpoint("nano-llama/checkpoint-200")
trainer.train()

Pretrained Detailes:

Once the model is trained the pretrained dicretory will looks like this:

nano-llama/
    ├── /checkpoint-200
    ├── train_config.yaml
    ├── model_config.yaml
    ├── pytorch_model.pt
    ├── VOCAB.model
    └── VOCAB.vocab

Inference

import langtrain as lt

tokenizer = lt.tokenizer.Tokenizer.from_pretrained("nano-llama")
model = lt.model.LlamaModel.from_pretrained("nano-llama")
inputs = tokenizer.encode("Sherlock Holmes")
output = model.generate(inputs, eos_id=tokenizer.eos_token_id, max_new_tokens=50)
tokenizer.decode(output)

Available Model Architectures to train

Model Architecture Source Repository
GPT OpenAI GPT
LLaMA Meta LLaMA
BERT Google BERT
VIT Vision Transformer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langtrain-0.0.3.tar.gz (412.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langtrain-0.0.3-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file langtrain-0.0.3.tar.gz.

File metadata

  • Download URL: langtrain-0.0.3.tar.gz
  • Upload date:
  • Size: 412.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for langtrain-0.0.3.tar.gz
Algorithm Hash digest
SHA256 60b863c9ebf64261255e125c43094ff9a365debad198513d0d065acab6b8b966
MD5 a7307c8e97e8468d9c3d0be6b61d5e25
BLAKE2b-256 18bfcbb072f7e3ad36766c70ed41313f2afad3f888b67419e3097e9dced6f9d4

See more details on using hashes here.

File details

Details for the file langtrain-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: langtrain-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for langtrain-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 995905e42b2dd0a136cdb19f1cb1a6cb59bf2748651b7228fff56106c59dec8a
MD5 9180a1b8012deb06ed2b69bb12b5d36e
BLAKE2b-256 567ce3040666576069707fdef6469759f0b229549d94542eae8c0e8f95d443f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page