Skip to main content

A Python package that enables users to create and train their own Language Learning Models (LLMs) from scratch using custom datasets. This package provides a simplified approach to building, training, and deploying custom language models tailored to specific domains or use cases.

Project description

createllm

A Python package that enables users to create and train their own Language Learning Models (LLMs) from scratch using custom datasets. This package provides a simplified approach to building, training, and deploying custom language models tailored to specific domains or use cases.

🎯 Core Purpose

createllm allows you to:

  • Train custom language models on your specific text data
  • Create domain-specific LLMs for specialized applications
  • Build and experiment with different model architectures
  • Deploy trained models for text generation tasks

✨ Key Features

  • 🔨 Build LLMs from scratch using your own text data
  • 🚀 Multi-threaded training for faster model development
  • 📊 Real-time training progress tracking
  • 🎛️ Configurable model architecture
  • 💾 Easy model saving and loading
  • 🎯 Custom text generation capabilities
  • 📈 Built-in performance monitoring

📋 Requirements

pip install torch torchvision tqdm dill

🚀 Quick Start Guide

1. Prepare Your Training Data

Place your training text in a file. The model learns from this text to generate similar content.

my_training_data.txt
├── Your custom text
├── Can be articles
├── Documentation
└── Any text content you want the model to learn from

2. Train Your Custom LLM

from createllm import ModelConfig, GPTTrainer, TextFileProcessor

# Initialize model configuration
config = ModelConfig(
    vocab_size=None,  # Will be automatically set based on your data
    n_embd=384,      # Embedding dimension
    block_size=256,  # Context window size
    n_layer=4,       # Number of transformer layers
    n_head=4        # Number of attention heads
)

# Create trainer instance
trainer = GPTTrainer(
    text_file="path/to/my_training_data.txt",
    learning_rate=3e-4,
    batch_size=64,
    max_iters=5000,
    eval_interval=500,
    saved_path="path/to/save/model"
)

# Start training
trainer.trainer()  # This will automatically process text and train the model

3. Use Your Trained Model

from createllm import LLMModel

# Load your trained model
model = LLMModel("path/to/saved/model")

# Generate text
generated_text = model.generate("Your prompt text")
print(generated_text)

📝 Example Use Cases

  1. Domain-Specific Documentation Generator
# Train on technical documentation
trainer = GPTTrainer(
    text_file="technical_docs.txt",
    saved_path="tech_docs_model"
)
trainer.trainer()
  1. Custom Writing Style Model
# Train on specific author's works
trainer = GPTTrainer(
    text_file="author_works.txt",
    saved_path="author_style_model"
)
trainer.trainer()
  1. Specialized Content Generator
# Train on specific content type
trainer = GPTTrainer(
    text_file="specialized_content.txt",
    saved_path="content_model"
)
trainer.trainer()

⚙️ Model Configuration Options

Customize your model architecture based on your needs:

config = ModelConfig(
    n_embd=384,     # Larger for more complex patterns
    block_size=256, # Larger for longer context
    n_layer=8,      # More layers for deeper understanding
    n_head=8,       # More heads for better pattern recognition
    dropout=0.2     # Adjust for overfitting prevention
)

💡 Training Tips

  1. Data Quality

    • Clean your training data
    • Remove irrelevant content
    • Ensure consistent formatting
  2. Resource Management

    trainer = GPTTrainer(
        batch_size=32,     # Reduce if running out of memory
        max_iters=5000,    # Increase for better learning
        eval_interval=500  # Monitor training progress
    )
    
  3. Model Size vs Performance

    • Smaller models (n_layer=4, n_head=4): Faster training, less complex patterns
    • Larger models (n_layer=8+, n_head=8+): Better understanding, more resource intensive

🔍 Monitoring Training

The training process provides real-time feedback:

step 0: train loss 4.1675, val loss 4.1681
step 500: train loss 2.4721, val loss 2.4759
step 1000: train loss 1.9842, val loss 1.9873
step 1500: train loss 1.1422, val loss 1.1422
...

📁 Saved Model Structure

saved_model/
├── model.pt           # Model weights
├── encoder.pickle    # Text encoder
├── decoder.pickle    # Text decoder
└── config.json      # Model configuration

⚠️ Limitations

  • Training requires significant computational resources
  • Model quality depends on training data quality
  • Larger models require more training time and resources

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests.

📫 Support

For issues and questions, please open an issue in the GitHub repository.

📄 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Based on the GPT architecture with modifications for custom training and ease of use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

createllm-0.1.9.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

createllm-0.1.9-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file createllm-0.1.9.tar.gz.

File metadata

  • Download URL: createllm-0.1.9.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for createllm-0.1.9.tar.gz
Algorithm Hash digest
SHA256 6e085c221009389888d2a9bb09469e211230527513a58e37c0db67338d49ae72
MD5 2e26a0cf1a44f20046d379b284196332
BLAKE2b-256 871c317f1663999c1c0f07b625534c8fb1bff1e7142d385c211fec412f3a6108

See more details on using hashes here.

File details

Details for the file createllm-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: createllm-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for createllm-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 48fce5a5315d36e4f408afeec37442e3a4560ec178dbd8b026850a97167937e5
MD5 0abe100a5a5f760f2ee3c1890ef74779
BLAKE2b-256 cfdffc72da8a57ed2467ad43790ddde3d5a6787ddbc2fe279b46c108cec97275

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page