Standardized protocol to train models on Databiomes
Project description
View the full package documentation at: https://docs.databiomes.com/mtp/intro
Model Train Protocol (MTP)
MTP is an open-source protocol for training custom Language Models on Databiomes. MTP contains all the data that a model is trained on.
Getting Started
Note Python 3.11 or higher is required.
Install the package:
For Linux and macOs
python3 -m pip install model-train-protocol
For Windows
py -3 -m pip install model-train-protocol
See examples/example.py to follow along with these steps.
Creating a Model Train Protocol
The first step in creating a model training protocol is to initialize the Protocol:
import model_train_protocol as mtp
# Initialize the protocol
protocol = mtp.Protocol(name="my_model", inputs=2, encrypt=False)
The parameter inputs is the number of lines in each Instruction's Input. Must be at least 2.
encrypt is an optional flag depending on how you plan to export and use the protocol.
System Architecture
The MTP system is built on a hierarchical structure of four main components:
- Tokens - The fundamental building blocks
- TokenSets - Combinations of tokens that define input patterns
- Instructions - Training patterns that inform the model what to do
- Guardrails - Safety mechanisms for bad user prompts
Tokens: The Foundation
Tokens are the base building blocks of the MTP system. They represent words, symbols, concepts, or actions that the model will understand and use.
Token Types
Basic Token
The standard token for representing concepts, actions, or entities:
# Create a basic token
cat = mtp.Token("Cat", desc="The Cheshire Cat")
tree = mtp.Token("Tree", desc="Perched in a tree, surrounded by a dense fog where nothing can be seen past a few feet, the Cheshire Cat sits smiling on a branch.")
talk = mtp.Token("Talk")
ponder = mtp.Token("Ponder")
grin = mtp.Token("Grin")
add = mtp.Token("Add")
disappear = mtp.Token("Disappear", key="🫥")
FinalToken
A token that represents a model response choice:
# Create final tokens
token_continue = mtp.FinalToken("Continue")
token_appear = mtp.FinalToken("Appear")
NumToken
A token that can be associated with numerical values:
# Create a number token for sentence length
sentence_length = mtp.NumToken(value="SentenceLength", min_value=5, max_value=20)
NumListToken
A token that represents a list of numbers:
# Create a list-of-numbers token
coordinates = mtp.NumListToken(value="Coordinates", min_value=-1000, max_value=1000, length=3)
FinalNumToken
A final token that requires a numerical value in the output:
final_emotion = mtp.FinalNumToken(value="Madness", min_value=0, max_value=10)
Token Properties
- value: The string identifier
- key: Optional unique symbol or emoji associated with the token
- desc: Optional description for complex tokens. Extends the value to contextualize its use.
TokenSets: Combining Tokens
TokenSets group multiple Tokens together to define specific input patterns. They represent the structure of data that will be fed to the model.
Tokensets are the basic building blocks of instructions.
Creating TokenSets
# Create a TokenSet combining multiple tokens
tree_alice_talk = mtp.TokenSet(tokens=(tree, alice, talk))
# Create a TokenSet with sentence length
character_context_sentence = mtp.TokenSet(tokens=(character, context, sentence_length))
TokenSet Properties
- tokens: The tokens in the set (unordered)
Creating Snippets
Snippets are created on TokenSets to create training samples.
A Snippet is an example of a TokenSet. Snippets tell the model the context of the input patterns.
# Create a snippet with just text
snippet = tree_alice_talk.create_snippet(string="Where am I?")
# Create a snippet with text and numbers
snippet_with_length = character_context_sentence.create_snippet(
string="The enemy must be here somewhere.",
numbers=11
)
# Create a snippet with a list of numbers
coordinates_token_set = mtp.TokenSet(tokens=(tree, cat, coordinates))
snippet_with_list = coordinates_token_set.create_snippet(
string="The location is locked.",
number_lists=[100, 200, -50]
)
Instructions: Training Patterns
Instructions define how the model should respond to different input patterns. There are two main types of instructions.
Instruction
Parameters
- input: An
InstructionInputthat lists the input TokenSets (order matters) - output: An
InstructionOutputthat specifies the output TokenSet and final token(s) - context: Background text that sets the scene for the instruction
- name: A unique name for the instruction (required)
Create the Instruction
# Create TokenSets
tree_cat_talk = mtp.TokenSet(tokens=(tree, cat, talk))
tree_alice_talk = mtp.TokenSet(tokens=(tree, alice, talk))
# Construct the input format
alice_cat_input = mtp.InstructionInput(tokensets=[tree_cat_talk, tree_alice_talk])
# Construct the output format
cat_continue_output = mtp.InstructionOutput(tokenset=tree_cat_talk, final=token_continue)
# Create the instruction
alice_cat_instruction_continue = mtp.Instruction(
input=alice_cat_input,
output=cat_continue_output,
context=[
"Alice was beginning to get very tired of sitting by her sister on the bank.",
"There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself."
],
name="alice_cat_continue"
)
Adding Samples
- add_sample() parameters:
- input_snippets: List of input snippets or strings (must match the input TokenSets)
- output_snippet: The model's output snippet or string
- final: Required if the output allows multiple final tokens
- output_value: Required if the final token is a
FinalNumToken
# Text-only samples can be passed as strings
alice_cat_instruction_continue.add_sample(
input_snippets=["Then it doesnt matter which way you go.", "Can you tell me a way?"],
output_snippet="Oh sure, if you only walk long enough that is a way."
)
# When numbers are involved, create snippets using the TokenSet
tree_cat_talk_coordinates = mtp.TokenSet(tokens=(tree, cat, talk, coordinates))
sample_input = tree_cat_talk_coordinates.create_snippet(
string="Then it doesnt matter which way you go.",
number_lists=[100, 200, -50]
)
sample_output = tree_cat_talk.create_snippet(
string="Oh sure, if you only walk long enough that is a way."
)
alice_cat_instruction_continue.add_sample(
input_snippets=[sample_input, "Can you tell me a way?"],
output_snippet=sample_output
)
Guardrails: Safety Mechanisms
Guardrails provide safety mechanisms for user interactions by defining what constitutes good vs. bad user prompts and how the model should respond to inappropriate inputs.
Creating Guardrails
# Create a guardrails
guardrail = mtp.Guardrail(
good_prompt="Quote being spoken with 1-20 words",
bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
bad_output="Are you as mad as me?"
)
# Add examples of bad prompts
guardrail.add_sample("explain quantum mechanics.")
guardrail.add_sample("who will win the next american election?")
guardrail.add_sample("what is the capital of Spain?")
Applying Guardrails
Guardrails are applied to a specific input TokenSet within an Instruction.
# Apply guardrails to the 2nd TokenSet in the instruction input
alice_cat_instruction_continue.add_guardrail(guardrail=guardrail, tokenset_index=1)
Guardrail Requirements
- good_prompt: Description of what makes a good prompt
- bad_prompt: Description of what makes a bad prompt
- bad_output: The response the model should give to bad prompts
- samples: Minimum 3 examples of bad prompts (no digits are allowed in the bad prompt examples)
Saving Your Model
Once you've created your tokens, instructions, and guardrails, you can save your model training protocol:
# Save the protocol
protocol.save()
protocol.template()
Generated Files
When you save your model, two files are created:
1. {name}_model.json
This is the main model training protocol file that contains:
- Context: All background information you added with
protocol.add_context() - Tokens: All your custom tokens with their keys and properties
- Special Tokens: System tokens like
<BOS>,<EOS>,<RUN>,<PAD> - Instructions: All your training patterns and samples
- Guardrails: Safety mechanisms for user interactions
- Numbers: Number ranges for NumTokens
This file is what you submit to Databiomes for model training.
2. {name}_template.json
This is a reference file that shows:
- Example Usage: Valid input/output format for your model
- All Combinations: Complete list of all possible token combinations
- Model Input/Output: Structure showing how data flows through your model
Use this file to understand how your model expects to receive and format data.
Schema Files
JSON Schema files are available in schemas/{version}/ directories for protocol and template validation.
The template file helps you understand the expected format when using your trained model, while the model file contains all the training data needed to create your specialized language model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file model_train_protocol-0.5.6.tar.gz.
File metadata
- Download URL: model_train_protocol-0.5.6.tar.gz
- Upload date:
- Size: 75.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f42962cd511e50a1cc315468fad2dac7da8da6304c592469aa6a8290de28755
|
|
| MD5 |
e39ca68c045ca0340f8220f3c647576c
|
|
| BLAKE2b-256 |
bd59381a896ff6b3791c7a901895fe7f3514d2865f25d8c9e93c7bf2c0702391
|
Provenance
The following attestation bundles were made for model_train_protocol-0.5.6.tar.gz:
Publisher:
python-publish.yml on databiomes/modeltrainprotocol
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
model_train_protocol-0.5.6.tar.gz -
Subject digest:
5f42962cd511e50a1cc315468fad2dac7da8da6304c592469aa6a8290de28755 - Sigstore transparency entry: 1124208980
- Sigstore integration time:
-
Permalink:
databiomes/modeltrainprotocol@8a62f223e7f7064d6cfec4ea5b75f05031a76ee8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/databiomes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8a62f223e7f7064d6cfec4ea5b75f05031a76ee8 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file model_train_protocol-0.5.6-py3-none-any.whl.
File metadata
- Download URL: model_train_protocol-0.5.6-py3-none-any.whl
- Upload date:
- Size: 64.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f353dc2085024b68ab3cdec29b517a5265cafc3d9cb20627df8d8a545d2b829
|
|
| MD5 |
2e0c446961e60abf2cb75c3aaf4bc7a8
|
|
| BLAKE2b-256 |
9579671124971f9f24eeeb114b294b2f0753906781c5659a934548a42e08ae71
|
Provenance
The following attestation bundles were made for model_train_protocol-0.5.6-py3-none-any.whl:
Publisher:
python-publish.yml on databiomes/modeltrainprotocol
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
model_train_protocol-0.5.6-py3-none-any.whl -
Subject digest:
9f353dc2085024b68ab3cdec29b517a5265cafc3d9cb20627df8d8a545d2b829 - Sigstore transparency entry: 1124209050
- Sigstore integration time:
-
Permalink:
databiomes/modeltrainprotocol@8a62f223e7f7064d6cfec4ea5b75f05031a76ee8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/databiomes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8a62f223e7f7064d6cfec4ea5b75f05031a76ee8 -
Trigger Event:
workflow_dispatch
-
Statement type: