A Library for Training Language Models with LSTM in TensorFlow
Project description
WordCraft - Language Model Training Library
WordCraft is a Python library designed to train language models using LSTM (Long Short-Term Memory) networks. It provides utilities to load text data, preprocess it for training, build and train a model, and generate text based on trained models. The library also supports saving and loading models and tokenizers in a zip format.
Features
- Data Loading: Load text data from a file.
- Preprocessing (Optional): Preprocess the data by tokenizing and preparing it for training.
- Model Building (Optional): Build a customizable LSTM-based language model.
- Training: Train the model using the processed data.
- Text Generation: Generate text from a trained model based on a seed text.
- Model Saving & Loading: Save and load the trained model and tokenizer in a zip file for easy distribution.
Installation
To use WordCraft, install the package via pip:
pip install wordcraft
Usage
1. Initialize the Library
from wordcraft import WordCraft
# Create an instance of the WordCraft class
wc = WordCraft()
# Load your text data
wc.load_data("your_text_file.txt")
2. Optional: Preprocess Data and Build the Model
You can optionally preprocess the data and customize the model architecture before training.
Preprocess Data (Optional)
Preprocessing prepares the text data for training by tokenizing it into input-output pairs.
# Optional: Preprocess data and prepare it for training
wc.preprocess_data()
Build the Model (Optional)
The model can be customized with the following parameters:
embedding_dim: The dimension of the embedding layer.lstm_units: The number of LSTM units in the model.
You can either use the default model or customize the architecture.
# Optional: Build the model (default or customizable)
wc.build_model(embedding_dim=128, lstm_units=256)
3. Train the Model
Once the model is built, you can train it on the preprocessed data. Specify the number of epochs and batch size as needed.
# Train the model
wc.train(epochs=10, batch_size=32)
4. Generate Text
After training the model, you can generate text using a seed phrase.
# Generate text based on a seed text
generated_text = wc.generate_text("Once upon a time", max_length=50)
print(generated_text)
5. Save and Load the Model
Save the Model
You can save the trained model and tokenizer to a zip file.
# Save the model and tokenizer to a zip file
wc.save_model("my_language_model")
Load the Model
You can load the saved model and tokenizer for further use or text generation.
# Load the saved model and tokenizer from a zip file
wc.load_model("my_language_model")
# Use the model to generate text
generated_text = wc.generate_text("Once upon a time", max_length=50)
print(generated_text)
Example Usage
1. Example: Without Preprocessing and Custom Model Building
In this example, we load the data, train the model, and generate text without preprocessing or customizing the model architecture.
from wordcraft import WordCraft
# Create an instance of the WordCraft class
wc = WordCraft()
# Load the text data
wc.load_data("your_text_file.txt")
# Train the model without preprocessing or custom model building
wc.train(epochs=10, batch_size=32)
# Generate text based on a seed
generated_text = wc.generate_text("Once upon a time", max_length=50)
print(generated_text)
2. Example: With Preprocessing and Custom Model Building
In this example, we preprocess the data, build a custom model, train it, and generate text.
from wordcraft import WordCraft
# Create an instance of the WordCraft class
wc = WordCraft()
# Load the text data
wc.load_data("your_text_file.txt")
# Optional: Preprocess data
wc.preprocess_data()
# Optional: Build the custom model (using different embedding dimensions and LSTM units)
wc.build_model(embedding_dim=256, lstm_units=512)
# Train the model
wc.train(epochs=10, batch_size=32)
# Generate text based on a seed
generated_text = wc.generate_text("In a faraway land", max_length=100)
print(generated_text)
Dataset File Structure
The text file used for training (e.g., your_text_file.txt) should contain raw text data. Each line in the file represents a portion of text that the model will learn to predict.
Example Dataset File Structure:
Once upon a time, in a land far away,
There was a brave knight who ventured into the forest.
The sun was setting, and the sky was painted in hues of orange and pink.
...
The library will read the entire text file, split it into tokens (words), and prepare them for training. Ensure that your dataset is large enough to train a meaningful model.
Requirements
- Python 3.x
- TensorFlow 2.x
- NumPy
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Feel free to fork the repository and submit pull requests. If you find any bugs or have feature requests, please open an issue.
Contact
For questions or suggestions, contact me at [bandinvisible8@gmail.com].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordcraft-0.1-py3-none-any.whl.
File metadata
- Download URL: wordcraft-0.1-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c0349201ac12cf57cc618e64b8bcf6847a2f9ccc6b7b3aaecf05f016db44c22
|
|
| MD5 |
c592b9db5bb580e2010e8a53f3de30f9
|
|
| BLAKE2b-256 |
e68b34018111ade629db53fff99d5d56d3c8e47aa7252f352ffb2a068bc58194
|