A comprehensive toolkit for building, training, and deploying language models

These details have not been verified by PyPI

Project links

Project description

LLMBuilder - Professional Language Model Toolkit

A comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.

About LLMBuilder

LLMBuilder is a production-ready framework for training and fine-tuning Large Language Models (LLMs). Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.

Quick Start

Installation

pip install llmbuilder

Initialize a New Project

# Create a new project with default structure
llmbuilder init my_llm_project

# Navigate to your project directory
cd my_llm_project

This creates a structured project with the following directories:

data/raw - Place your raw input files here (.txt, .pdf, .docx)
data/processed - Processed text files
tokenizer - Tokenizer files
models/checkpoints - Training checkpoints
models/final - Final trained models
configs - Configuration files
outputs - Output files

And generates a README.md with quick start instructions:

# my_llm_project

This is an LLM project created with LLMBuilder.

## Project Structure

- `data/` - Data files
- `tokenizer/` - Tokenizer files
- `models/` - Model checkpoints and final models
- `configs/` - Configuration files
- `outputs/` - Output files

## Quick Start

1. Prepare your data in `data/raw/`
2. Process data: `llmbuilder data load -i data/raw -o data/processed/input.txt`
3. Train tokenizer: `llmbuilder tokenizer train -i data/processed/input.txt -o tokenizer/`
4. Train model: `llmbuilder train model -d data/processed/input.txt -t tokenizer/ -o models/checkpoints/`
5. Generate text: `llmbuilder generate text -m models/checkpoints/latest.pt -t tokenizer/ -p "Your prompt here"`

Documentation

Complete documentation is available at: https://qubasehq.github.io/llmbuilder-package/

The documentation includes:

Getting Started Guide - From installation to your first model
User Guides - Comprehensive guides for all features
CLI Reference - Complete command-line interface documentation
Python API - Full API reference with examples
Examples - Working code examples for common tasks
FAQ - Answers to frequently asked questions

CLI Usage

Getting Started

# Show help and available commands
llmbuilder --help

# Initialize a new project
llmbuilder init my_project

# Interactive welcome guide for new users
llmbuilder welcome

# Show package information and credits
llmbuilder info

Configuration Management

# List available configuration templates
llmbuilder config templates

# Create a configuration from a template
llmbuilder config create --preset cpu_small -o configs/my_config.json

# Validate configuration with detailed reporting
llmbuilder config validate configs/my_config.json

Data Processing Pipeline

# Process raw data files
llmbuilder data load -i data/raw -o data/processed/input.txt --clean

# Remove duplicates from your data
llmbuilder data deduplicate -i data/processed/input.txt -o data/processed/clean.txt --method both

# Train custom tokenizer
llmbuilder tokenizer train -i data/processed/clean.txt -o tokenizer/ --vocab-size 16000

Model Training & Operations

# Train model
llmbuilder train model -d data/processed/clean.txt -t tokenizer/ -o models/checkpoints

# Interactive text generation setup
llmbuilder generate text --setup

# Generate text with custom parameters
llmbuilder generate text -m models/checkpoints/latest.pt -t tokenizer/ -p "Hello world" --temperature 0.8 --max-tokens 100

Model Export

# Convert to GGUF format for deployment
llmbuilder export gguf models/checkpoints/latest.pt -o models/final/model.gguf -q Q8_0

Python API

import llmbuilder as lb

# Load a preset config and build a model
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)

# Train (example; see examples/train_tiny.py for a runnable script)
from llmbuilder.data import TextDataset
dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)

# Generate text
text = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="Hello world",
    max_new_tokens=50,
)
print(text)

Configuration Management

LLMBuilder provides flexible configuration management:

# List available templates
llmbuilder config templates

# Create a configuration from a template
llmbuilder config create --preset cpu_small -o configs/my_config.json

# Validate your configuration
llmbuilder config validate configs/my_config.json

System Requirements

Python 3.8 or higher
For PDF OCR Processing: Tesseract OCR
For GGUF Model Conversion: llama.cpp or compatible tools

Troubleshooting

Installation Issues

Missing Optional Dependencies

# Check what's installed
python -c "import llmbuilder; print('LLMBuilder installed')"

# Install missing dependencies
pip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers

# Verify specific features
python -c "import pytesseract; print('OCR available')"
python -c "import sentence_transformers; print('Semantic deduplication available')"

System Dependencies

# Tesseract OCR (for PDF processing)
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng

# Verify Tesseract installation
tesseract --version
python -c "import pytesseract; pytesseract.get_tesseract_version()"

Processing Issues

PDF Processing Problems

# Enable debug logging
export LLMBUILDER_LOG_LEVEL=DEBUG

# Common fixes:
# 1. Install language packs: sudo apt-get install tesseract-ocr-eng
# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
# 3. Lower OCR threshold: --ocr-threshold 0.3

Memory Issues with Large Datasets

# Use configuration to optimize memory usage
llmbuilder config from-template cpu_optimized_config -o memory_config.json \
  --override data.ingestion.batch_size=50 \
  --override data.deduplication.batch_size=500 \
  --override data.deduplication.use_gpu_for_embeddings=false

# Process in smaller chunks
llmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2

Semantic Deduplication Performance

# GPU issues - disable GPU acceleration
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu

# Slow processing - increase batch size
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000

# Memory issues - reduce embedding cache
llmbuilder config from-template basic_config -o config.json \
  --override data.deduplication.embedding_cache_size=5000

GGUF Conversion Issues

Missing llama.cpp

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Add to PATH or specify location
export PATH=$PATH:/path/to/llama.cpp

# Alternative: Use Python package
pip install llama-cpp-python

# Test conversion
llmbuilder convert gguf --help

Conversion Failures

# Check available conversion scripts
llmbuilder convert gguf model.pt -o test.gguf --verbose

# Try different quantization levels
llmbuilder convert gguf model.pt -o test.gguf -q F16  # Less compression
llmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced

# Increase timeout for large models
llmbuilder config from-template basic_config -o config.json \
  --override gguf_conversion.conversion_timeout=7200

Configuration Issues

Validation Errors

# Validate configuration with detailed output
llmbuilder config validate my_config.json --detailed

# Common fixes:
# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size
# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length
# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0

Template Issues

# List available templates
llmbuilder config templates

# Create from working template
llmbuilder config from-template basic_config -o working_config.json

# Validate before use
llmbuilder config validate working_config.json

Documentation

Complete documentation is available at: https://qubasehq.github.io/llmbuilder-package/

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Nov 17, 2025

1.0.3

Sep 2, 2025

1.0.2

Sep 2, 2025

1.0.1

Sep 2, 2025

1.0.0

Sep 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmbuilder-2.0.0.tar.gz (1.3 MB view details)

Uploaded Nov 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmbuilder-2.0.0-py3-none-any.whl (156.7 kB view details)

Uploaded Nov 17, 2025 Python 3

File details

Details for the file llmbuilder-2.0.0.tar.gz.

File metadata

Download URL: llmbuilder-2.0.0.tar.gz
Upload date: Nov 17, 2025
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmbuilder-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`203419e890c95658bf7f3c7bbbb80db008f3b6e0445bd6821d309b5b7174e8ba`
MD5	`373c5f4ad2aed9482bebb1feb90eba44`
BLAKE2b-256	`0c05c0bd4e3535689d9a3450d0f00180fa2cb9a4bda0ad6f4c7a67f054435cf7`

See more details on using hashes here.

File details

Details for the file llmbuilder-2.0.0-py3-none-any.whl.

File metadata

Download URL: llmbuilder-2.0.0-py3-none-any.whl
Upload date: Nov 17, 2025
Size: 156.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llmbuilder-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`04c1586ac8a2335bad4d6bf72b576d09d8f7a620ea367c829351ed84a1c6916b`
MD5	`51ffaa71e1a3c6015450124e8b540bdd`
BLAKE2b-256	`afe9062f7ae9994f10679291fca1ff0567d9ba95fb5b0e1144bb8e03529c55e1`

See more details on using hashes here.

llmbuilder 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMBuilder - Professional Language Model Toolkit

About LLMBuilder

Quick Start

Installation

Initialize a New Project

Documentation

CLI Usage

Getting Started

Configuration Management

Data Processing Pipeline

Model Training & Operations

Model Export

Python API

Configuration Management

System Requirements

Troubleshooting

Installation Issues

Processing Issues

GGUF Conversion Issues

Configuration Issues

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes