Complete LLM Training and Deployment Pipeline with CLI
Project description
LLMBuilder
Created By Qubase
A comprehensive, production-ready implementation for training and fine-tuning Large Language Models from scratch. This project provides an advanced pipeline with enhanced document ingestion, intelligent deduplication, model training, automated GGUF conversion, and comprehensive testing - all optimized for both CPU and GPU environments.
Table of Contents
- Key Features
- System Requirements
- Installation
- Quick Start
- Project Structure
- Fine-tuning
- Text Generation
- Configuration
- Advanced Usage
- Monitoring
- Performance Optimization
- Troubleshooting
- Model Architecture
- Pre-trained Models
- License
- Contributing
Key Features
๐ Enhanced Data Processing
- Multi-Format Document Ingestion: HTML, EPUB, PDF (with OCR), Markdown, DOCX, TXT
- Intelligent Deduplication: Hash-based exact + embedding-based semantic duplicate removal
- OCR Support: Automatic fallback for scanned PDFs using Tesseract
- Advanced Text Cleaning: BeautifulSoup HTML processing, metadata extraction
๐ง Advanced Training Pipeline
- End-to-End Workflow: From raw documents to production-ready models
- Multiple Tokenizer Options: HuggingFace Tokenizers + SentencePiece CLI integration
- CPU/GPU Optimization: Efficient multi-threaded training with mixed precision
- Modern GPT Architecture: Transformer implementation with latest optimizations
๐ฆ Production-Ready Export
- Automated GGUF Conversion: Multiple quantization levels (f16, q8_0, q4_0)
- Quality Validation: Comprehensive model validation and quality scoring
- Batch Processing: Parallel conversion with error recovery
- llama.cpp Compatibility: Direct integration with inference engines
๐ง Developer Experience
- Comprehensive Testing: Automated test suite with pytest integration
- Robust Error Handling: Detailed logging and recovery mechanisms
- Modular Architecture: Clean, maintainable, extensible codebase
- Cross-Platform: Windows PowerShell + Linux/macOS Bash scripts
System Requirements
Minimum Requirements
- Python: 3.8 or higher
- RAM: 4GB minimum (8GB+ recommended for large datasets)
- Storage: 5GB+ free disk space
- OS: Windows 10+, Linux, or macOS
Additional Dependencies
- Tesseract OCR: For PDF OCR processing (see INSTALL_TESSERACT.md)
- Git: For repository management
- Optional: CUDA-compatible GPU for accelerated training
Installation
-
Clone the repository:
git clone <repository-url> cd LLMBuilder
-
Create and activate virtual environment:
# Linux/macOS python -m venv venv source venv/bin/activate # Windows python -m venv venv .\venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install Tesseract OCR (for PDF processing):
# Ubuntu/Debian sudo apt-get install tesseract-ocr # macOS brew install tesseract # Windows - see INSTALL_TESSERACT.md for detailed instructions
-
Verify installation:
python -c "import torch; print('PyTorch:', torch.__version__)" tesseract --version
System Preparation
System Requirements Check
Before starting, ensure your system meets the requirements:
# Linux/macOS
free -h # Check available memory
df -h # Check disk space
nproc # Check CPU cores
# Windows
# Use Task Manager โ Performance โ Memory/Disk
# Check CPU cores in System Information
Recommended Workflow
- Start with a small dataset (100MB) to test the pipeline
- Monitor system resources during initial runs
- Use checkpoints - training progress is saved automatically
- Check logs in
logs/training.logfor any issues
๐ Real-time Monitoring:
# Linux/Mac: Monitor system resources
htop
# Windows: Use Task Manager or Resource Monitor
Getting Started
For detailed instructions, see the ๐ Complete Usage Guide (USAGE.md) which includes:
- Step-by-step walkthroughs with example outputs
- Advanced configuration options for all components
- Troubleshooting guide with common solutions
- Performance optimization tips
- Platform-specific commands for Windows/Linux/macOS
- Integration examples with other tools
Project Structure
LLMBuilder/
โโโ data/ # Data directories
โ โโโ raw/ # Raw input files (all formats)
โ โโโ cleaned/ # Processed text files
โ โโโ deduped/ # Deduplicated content
โ โโโ tokens/ # Tokenized datasets
โ โโโ finetune/ # Fine-tuning datasets
โ โโโ ingest.py # Enhanced document ingester
โ โโโ dedup.py # Deduplication system
โ โโโ download_data.py # Script to download datasets
โ โโโ SOURCES.md # Data sources documentation
โ โโโ README_INGESTION.md # Ingestion documentation
โ
โโโ debug_scripts/ # Debugging utilities
โ โโโ debug_loader.py # Data loading debugger
โ โโโ debug_training.py # Training process debugger
โ โโโ debug_timestamps.py # Timing analysis
โ
โโโ eval/ # Model evaluation
โ โโโ eval.py # Evaluation scripts
โ
โโโ exports/ # Output directories
โ โโโ checkpoints/ # Training checkpoints
โ โโโ gguf/ # GGUF model exports
โ โโโ onnx/ # ONNX model exports
โ โโโ tokenizer/ # Saved tokenizer files
โ
โโโ finetune/ # Fine-tuning scripts
โ โโโ finetune.py # Fine-tuning implementation
โ โโโ __init__.py # Package initialization
โ
โโโ logs/ # Training and evaluation logs
โ
โโโ model/ # Model architecture
โ โโโ gpt_model.py # GPT model implementation
โ
โโโ scripts/ # Enhanced processing scripts
โ โโโ run_ingestion.py # Document ingestion CLI
โ โโโ enhanced_preprocess.py # Advanced preprocessing
โ โโโ train_sentencepiece.py # SentencePiece training
โ โโโ test_deduplication.py # Deduplication testing
โ
โโโ tests/ # Comprehensive test suite
โ โโโ test_ingestion.py # Document ingestion tests
โ โโโ test_deduplication.py # Deduplication tests
โ โโโ test_conversion_pipeline.py # GGUF conversion tests
โ โโโ test_tokenizer_trainer.py # Tokenizer tests
โ โโโ ... (many more test files)
โ
โโโ tools/ # Utility scripts
โ โโโ analyze_data.ps1 # PowerShell data analysis
โ โโโ analyze_data.sh # Bash data analysis
โ โโโ download_hf_model.py # HuggingFace model downloader
โ โโโ export_gguf.py # Enhanced GGUF export utility
โ โโโ conversion_pipeline.py # Automated GGUF conversion
โ โโโ quantization_manager.py # Advanced quantization
โ
โโโ training/ # Training pipeline
โ โโโ dataset.py # Dataset handling
โ โโโ preprocess.py # Data preprocessing
โ โโโ quantization.py # Model quantization
โ โโโ train.py # Main training script
โ โโโ train_tokenizer.py # Enhanced tokenizer training
โ โโโ utils.py # Training utilities
โ
โโโ .gitignore # Git ignore rules
โโโ config.json # Main configuration
โโโ config_cpu_small.json # Small CPU config
โโโ config_gpu.json # GPU configuration
โโโ inference.py # Inference script
โโโ quantize_model.py # Model quantization
โโโ README.md # This file
โโโ PIPELINE_UPDATES.md # Recent updates summary
โโโ INSTALL_TESSERACT.md # OCR installation guide
โโโ requirements.txt # Python dependencies
โโโ run.ps1 # Enhanced PowerShell runner
โโโ run.sh # Enhanced Bash runner
Quick Start
1. Prepare Your Data
Enhanced Document Support
Place your documents in data/raw/. The system now supports:
- Text files (.txt, .md)
- PDF files (.pdf) - with automatic OCR for scanned documents
- Word documents (.docx)
- Web content (.html)
- E-books (.epub)
- Markdown (.md)
Option 1: Download Sample Data
# Download sample corpus
python data/download_data.py --corpus
# Or download specific topics
python data/download_data.py --topic literature --count 5
python data/download_data.py --topic technology --count 3
Available topics: literature, science, technology, business, health, education
Option 2: Use Your Own Data
Simply place your documents in data/raw/ - the enhanced ingestion pipeline will automatically:
- Detect file formats
- Extract text with appropriate methods
- Handle OCR for scanned PDFs
- Clean and normalize content
2. Run the Pipeline
Linux/macOS:
chmod +x run.sh
./run.sh
Windows:
run.bat
Or using PowerShell:
.\run.ps1
3. Run Specific Stages
The enhanced pipeline includes new stages for better data processing:
# NEW: Enhanced document ingestion
./run.sh ingest
# NEW: Intelligent deduplication
./run.sh dedup
# Traditional preprocessing (optional)
./run.sh preprocess
# Train tokenizer
./run.sh tokenizer
# Train model
./run.sh train
# Evaluate model
./run.sh eval
# Fine-tune existing model
./run.sh finetune
# Interactive text generation
./run.sh inference
# NEW: Convert to GGUF format
./run.sh gguf
# NEW: Run comprehensive tests
./run.sh test
Windows PowerShell Examples:
# Enhanced document processing
.\run.ps1 -Stage ingest
# Run deduplication
.\run.ps1 -Stage dedup
# Complete pipeline
.\run.ps1 -Stage all
# Convert to GGUF
.\run.ps1 -Stage gguf
Enhanced Pipeline Stages
๐ Document Ingestion (ingest)
Advanced document processing with multi-format support:
# Process all supported formats with OCR
./run.sh ingest
# With custom options
python scripts/run_ingestion.py \
--input data/raw \
--output data/cleaned \
--ocr-lang eng fra deu \
--max-size 50 \
--recursive
Features:
- HTML Processing: BeautifulSoup-based cleaning, removes scripts/styles
- EPUB Support: Full e-book text extraction with metadata
- PDF with OCR: Automatic fallback to Tesseract for scanned documents
- Markdown Processing: Advanced parsing with table/code block support
- Progress Tracking: Real-time processing statistics
๐ Intelligent Deduplication (dedup)
Remove exact and near-duplicate content to improve training quality:
# Run both hash and embedding deduplication
./run.sh dedup
# Custom similarity threshold
python data/dedup.py \
--input-dir data/cleaned \
--output-dir data/deduped \
--similarity-threshold 0.85
Methods:
- Hash-based: Exact duplicate detection with text normalization
- Embedding-based: Semantic similarity using sentence-transformers
- Quality Preservation: Keeps highest quality version of duplicates
- Statistics: Detailed reporting of removed content
๐ฆ GGUF Conversion (gguf)
Automated conversion to GGUF format for production deployment:
# Convert with multiple quantization levels
./run.sh gguf
# Custom quantization options
python tools/conversion_pipeline.py \
exports/checkpoints/best_model.pt \
exports/gguf \
--quantization f16 q8_0 q4_0 q4_1 \
--tokenizer exports/tokenizer
Features:
- Multiple Quantization: f16, q8_0, q4_0, q4_1, q5_0, q5_1
- Quality Validation: Automatic validation and quality scoring
- Batch Processing: Parallel conversion with error recovery
- Metadata Preservation: Complete model metadata in GGUF format
๐งช Comprehensive Testing (test)
Automated test suite for quality assurance:
# Run all tests
./run.sh test
# Run specific test categories
python -m pytest tests/test_ingestion.py -v
python -m pytest tests/test_deduplication.py -v
python -m pytest tests/test_conversion_pipeline.py -v
Fine-tuning
To fine-tune the model on your own data:
- Place your training files in
data/finetune/ - The system will automatically use the latest checkpoint
- Run the fine-tuning script:
python finetune/finetune.py \ --config config.json \ --pretrained-model exports/checkpoints/latest.pt \ --train-data data/finetune/ \ --tokenizer-dir exports/tokenizer/
- Fine-tuned models save to
exports/checkpoints/finetuned/
Fine-tuning Configuration
You can customize fine-tuning by modifying these parameters:
finetune:
learning_rate: 0.0001 # Lower than training LR
batch_size: 4 # Adjust based on GPU memory
num_epochs: 3 # Number of fine-tuning epochs
warmup_steps: 100 # Learning rate warmup steps
Monitoring Fine-tuning
Monitor the fine-tuning process with:
tensorboard --logdir=exports/logs/finetune/
Text Generation
Run interactive text generation:
python inference.py --interactive
Options:
--temperature: Controls randomness (0.0-1.0)--top_k: Limit to top-k predictions--top_p: Nucleus sampling threshold
Configuration
This project includes multiple configuration files optimized for different hardware setups. Choose the one that best matches your environment:
Available Configurations
-
config.json - Balanced configuration for standard CPUs
- Moderate model size
- Good balance between speed and quality
- Works well on most modern laptops/desktops
-
config_gpu.json - Optimized for GPU training
- Larger model capacity
- Mixed precision training
- Gradient accumulation
- Best for NVIDIA GPUs with 8GB+ VRAM
-
config_cpu_small.json - For very limited CPUs
- Minimal memory footprint
- Smaller model size
- Reduced sequence length
- Ideal for testing or low-resource environments
Configuration Options
Model Architecture
model:
vocab_size: 16000 # Vocabulary size
embedding_dim: 384 # Size of token embeddings
num_layers: 6 # Number of transformer layers
num_heads: 6 # Number of attention heads
hidden_dim: 1536 # Size of feedforward layers
max_seq_length: 256 # Maximum sequence length
dropout: 0.1 # Dropout rate
use_bias: true # Use bias in linear layers
tie_weights: true # Tie input/output embeddings
Training Settings
training:
batch_size: 8 # Training batch size
learning_rate: 0.0002 # Learning rate
weight_decay: 0.01 # Weight decay for regularization
num_epochs: 10 # Number of training epochs
warmup_steps: 1000 # Warmup steps for learning rate
gradient_clip_norm: 1.0 # Gradient clipping
save_every: 1000 # Save checkpoint every N steps
eval_every: 500 # Evaluate every N steps
log_every: 10 # Log metrics every N steps
num_workers: 4 # Data loading workers
pin_memory: true # Pin memory for faster transfer
prefetch_factor: 2 # Batches to prefetch
use_mixed_precision: false # Enable mixed precision
Device Configuration
device:
use_cuda: false # Use CUDA if available
cuda_device: 0 # CUDA device index
use_mps: false # Use MPS on Apple Silicon
cpu_threads: 0 # Number of CPU threads (0 = all)
enable_mkldnn: true # Enable MKLDNN acceleration
mixed_precision: false # Global mixed precision flag
Choosing the Right Configuration
-
For GPU Training: Use
config_gpu.jsonpython training/train.py --config config_gpu.json
-
For Standard CPU Training: Use
config.jsonpython training/train.py --config config.json
-
For Low-End CPUs: Use
config_cpu_small.jsonpython training/train.py --config config_cpu_small.json
Custom Configuration
-
Copy an existing config file:
cp config.json my_config.json
-
Edit the parameters as needed
-
Use your custom config:
python training/train.py --config my_config.json
Important Notes
- Larger
batch_sizeandmax_seq_lengthrequire more memory num_workersshould be โค number of CPU cores- Enable
mixed_precisionfor GPUs with Tensor Cores (Volta, Turing, Ampere, etc.) - For small GPUs, reduce
batch_sizeand enablegradient_accumulation_steps - For very small CPUs, reduce
num_layers,embedding_dim, andhidden_dim
Debugging
The project includes several debugging scripts in the debug_scripts/ directory to help diagnose issues:
Available Debug Scripts
-
debug_loader.py
- Tests and profiles the data loading pipeline
- Helps identify bottlenecks in data loading
- Usage:
python debug_scripts/debug_loader.py --config config.json
-
debug_training.py
- Runs a minimal training loop with extensive logging
- Verifies model can complete a forward/backward pass
- Usage:
python debug_scripts/debug_training.py --config config.json --max-steps 10
-
debug_timestamps.py
- Profiles different components of the training loop
- Helps identify slow operations
- Usage:
python debug_scripts/debug_timestamps.py --config config.json
Debugging Tips
-
Reduced Test Case
- Use a tiny dataset with
--max-samples 10 - Set
num_workers=0to simplify data loading - Reduce
batch_sizeandmax_seq_length
- Use a tiny dataset with
-
Common Issues
- CUDA Out of Memory: Reduce
batch_sizeor model dimensions - Slow Training: Check data loading with
debug_loader.py - NaN/Inf Losses: Try gradient clipping and lower learning rate
- CUDA Out of Memory: Reduce
-
Verbose Logging
python training/train.py --config config.json --log-level DEBUG
-
Memory Profiling
python -m memory_profiler training/train.py --config config.json
Advanced Usage
CPU Optimization
Optimize for CPU training with:
- Multi-threading
- Memory efficiency
- Gradient accumulation
- MKLDNN acceleration
Data Processing
Example custom preprocessing:
from training.preprocess import DataPreprocessor
processor = DataPreprocessor(
min_length=100, # Min text length
max_length=500000, # Max text length
remove_urls=True, # Clean URLs
remove_emails=True, # Clean emails
normalize_whitespace=True
)
Training API
from training.train import Trainer
# Initialize trainer with JSON config
trainer = Trainer(config_path="config.json")
# Start training
trainer.train()
# Example with custom settings
custom_trainer = Trainer(
config_path="config.json",
train_data_dir="data/processed/train",
val_data_dir="data/processed/val",
output_dir="exports/models/custom_run"
)
custom_trainer.train()
Configuration Options:
config_path: Path to JSON config file (e.g.,config.json)train_data_dir: Directory containing training data (overrides config)val_data_dir: Directory containing validation data (overrides config)output_dir: Directory to save checkpoints and logs (overrides config)
Training Monitoring
Logs
- Console: Real-time progress
- File:
logs/training.log - Metrics:
logs/training_history.json
Checkpoints
checkpoint_epoch_N.pt: Regular savesbest_model.pt: Best validation scorelatest.pt: Most recent checkpoint
Performance Optimization
CPU Training
- Batch size: 8-32 (adjust for RAM)
- Use all CPU cores
- Enable gradient accumulation
- Try mixed precision if available
Memory Management
- Reduce
block_size(128-256) - Decrease
batch_size - Use smaller model dimensions
Speed Improvements
- Increase
batch_size(if RAM allows) - Use larger
block_sizefor context - Multiple data files improve shuffling
Troubleshooting
Common Issues
-
Out of Memory
- Reduce
batch_sizein config.yaml - Decrease
block_sizeor model size - Close other applications
- Reduce
-
No Training Data
- Check
data/raw/directory - Supported formats: .txt, .pdf, .docx
- Verify file permissions
- Check
-
Slow Training
- Optimize CPU thread count
- Reduce model size
- Monitor system resources
-
Import Errors
pip install -r requirements.txt python --version # Requires 3.8+
Check logs/ for detailed error messages.
Model Architecture
GPT-style transformer with:
- Multi-head self-attention
- GELU activation
- Pre-norm layer normalization
- Learned positional embeddings
- Weight-tied embeddings
Default Specs
- Parameters: ~50M
- Layers: 12
- Heads: 12
- Embedding: 768D
- Context: 512 tokens
- Vocabulary: 16K BPE
Recent Updates
โจ Latest Features (See PIPELINE_UPDATES.md)
- Enhanced Document Ingestion: Multi-format support with OCR
- Intelligent Deduplication: Hash + embedding-based duplicate removal
- Automated GGUF Conversion: Production-ready model export
- Comprehensive Testing: Full test suite with pytest
- Cross-platform Scripts: Enhanced PowerShell and Bash runners
๐ Future Enhancements
- Distributed Training: Multi-GPU and multi-node support
- Web Interface: Real-time monitoring dashboard
- More Architectures: LLaMA, BERT, and custom models
- Cloud Integration: AWS/GCP/Azure deployment
- Advanced Optimizations: Dynamic quantization, pruning
Pre-trained Models
Download models from HuggingFace:
python tools/download_hf_model.py \
--model Qwen/Qwen2.5-Coder-0.5B \
--output-dir ./models/Qwen2.5-Coder-0.5B
License
MIT Licensed. See LICENSE for details.
Contributing
Contributions welcome! Please submit PRs or open issues.
Quick Reference
๐ One-Command Setup
# Complete pipeline with enhanced features
./run.sh all # Linux/macOS
.\run.ps1 -Stage all # Windows PowerShell
๐ Essential Commands
# Enhanced document processing
./run.sh ingest # Process HTML, PDF, EPUB, etc.
./run.sh dedup # Remove duplicates intelligently
./run.sh train # Train your model
./run.sh gguf # Convert to GGUF format
./run.sh test # Run comprehensive tests
๐ Documentation
- USAGE.md - Complete usage guide with examples
- PIPELINE_UPDATES.md - Recent feature updates
- INSTALL_TESSERACT.md - OCR setup guide
- data/README_INGESTION.md - Document ingestion details
๐ Need Help?
- Check the Usage Guide for detailed examples
- Review logs in
logs/directory - Run tests:
./run.sh test - Open an issue on the repository
Get started by adding your documents to data/raw/ and running:
./run.sh all # Complete enhanced pipeline
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmbuilder-1.0.0.tar.gz.
File metadata
- Download URL: llmbuilder-1.0.0.tar.gz
- Upload date:
- Size: 267.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3514a6e971e86aa9da00a8fed9bd6012daee6efe3e96b9451cc5b8ffc393845
|
|
| MD5 |
bfcdd7471609607b7d5565bda6fe0038
|
|
| BLAKE2b-256 |
c281fcd0bbfe5c96118a5297f7bc4ddc0274bd16aa12c9342ce6b4eb2a7e0597
|
File details
Details for the file llmbuilder-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llmbuilder-1.0.0-py3-none-any.whl
- Upload date:
- Size: 301.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb10f04abcae18192951dd4edb388345d507c080059b834c5f28978d1ec5b6b3
|
|
| MD5 |
0aa915af4cb7b51241fff68bb54d22ca
|
|
| BLAKE2b-256 |
22791fbb8d0eda076aad38c98faf6909aa1caec60542c9ee680ac3e2ba1e3009
|