Standardized protocol to train models on Databiomes

These details have been verified by PyPI

Project links

Owner

Databiomes

GitHub Statistics

Maintainers

klempka

These details have not been verified by PyPI

Project links

Documentation

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

View the full package documentation at: https://docs.databiomes.com/mtp/intro

Model Train Protocol (MTP)

MTP is an open-source protocol for training custom Language Models on Databiomes. MTP contains all the data that a model is trained on.

Getting Started

Install the package:

For Linux and macOs

python3 -m pip install model-train-protocol

For Windows

py -3 -m pip install model-train-protocol

See examples/example.py to follow along with these steps.

Creating a Model Train Protocol

The first step in creating a model training protocol is to initialize the Protocol:

import model_train_protocol as mtp

# Initialize the protocol
protocol = mtp.Protocol(name="my_model", inputs=2)

The parameter inputs is the number of lines in each Instruction's Input. Must be at least 2.

System Architecture

The MTP system is built on a hierarchical structure of four main components:

Tokens - The fundamental building blocks
TokenSets - Combinations of tokens that define input patterns
Instructions - Training patterns that inform the model what to do
Guardrails - Safety mechanisms for bad user prompts

Tokens: The Foundation

Tokens are the base building blocks of the MTP system. They represent words, symbols, concepts, or actions that the model will understand and use.

Token Types

Basic Token

The standard token for representing concepts, actions, or entities:

# Create a basic token
cat = mtp.Token("Cat", desc="The Cheshire Cat")
tree = mtp.Token("Tree", desc="Perched in a tree, surrounded by a dense fog where nothing can be seen past a few feet, the Cheshire Cat sits smiling on a branch.")
talk = mtp.Token("Talk")
ponder = mtp.Token("Ponder")
grin = mtp.Token("Grin")
add = mtp.Token("Add")
disappear = mtp.Token("Disappear", key="🫥")

UserToken

A specialized token that represents user input. These tokens are used when the model needs to respond to user prompts:

# Create a user token
alice = mtp.UserToken("Alice")

NumToken

A token that can be associated with numerical values:

# Create a number token for sentence length
sentence_length = mtp.NumToken(value="SentenceLength", min_value=5, max_value=20)

Token Properties

value: The string identifier
key: Optional unique symbol or emoji associated with the token
desc: Optional description for complex tokens. Extends the value to contextualize its use.

TokenSets: Combining Tokens

TokenSets group multiple Tokens together to define specific input patterns. They represent the structure of data that will be fed to the model.

Tokensets are the basic building blocks of instructions.

Creating TokenSets

# Create a TokenSet combining multiple tokens
tree_alice_talk = mtp.TokenSet(tokens=(tree, alice, talk))

# Create a TokenSet with sentence length
character_context_sentence = mtp.TokenSet(tokens=(character, context, sentence_length))

TokenSet Properties

tokens: The tokens in the set (unordered)

Creating Snippets

Snippets are created on TokenSets to create training samples.

A Snippet is a example of a TokenSet. Snippets tell the model the context of the input patters.

# Create a snippet with just text
snippet = tree_alice_talk.create_snippet(string="Where am I?")

# Create a snippet with text and sentence length
snippet_with_length = character_context_sentence.create_snippet(string="The enemy must be here somewhere.", numbers=[11])

Instructions: Training Patterns

Instructions define how the model should respond to different input patterns. There are two main types of instructions.

Instruction

Parameters

context: Sequence of TokenSets that provide background information
response: The TokenSet that defines the model's response pattern (cannot contain UserTokens)
final: A Token that represents the final action or result
name: A unique name for the instruction (required)

Create the Instruction

For scenarios where the model responds without user input:

# Create TokenSets
cat_pondering = mtp.TokenSet(tokens=(tree, cat, ponder))
cat_grinning = mtp.TokenSet(tokens=(tree, cat, grin))

# Create a simple instruction for the Cat's internal thoughts
cat_pondering_instruction_disappear = mtp.Instruction(
    context=[cat_pondering],
    response=cat_grinning,
    final=disappear,
    name="cat_pondering_instruction_disappear"
)

Adding Samples

add_sample() parameters:
- inputs: List of context snippets that will be added to the Instruction
- response_snippet: The model's output snippet
- value: Optional numerical value (required if final Token is a NumToken)

# Samples must be made on their associated TokenSets
sample_context = cat_pondering.create_snippet(
    string="Why do I keep vanishing and reappearing so suddenly?"
)
sample_output = cat_grinning.create_snippet(
    string="Because it amuses me, and it keeps everyone wondering whether I'm truly here at all."
)

cat_pondering_instruction_disappear.add_sample(
    input_snippets=[sample_context],
    output_snippet=sample_output
)

ExtendedInstruction

Parameters

context: Sequence of TokenSets that provide background information (the last TokenSet must include at least one UserToken)
final: A Token that represents the final action or result
name: A unique name for the instruction (required)

Create the ExtendedInstruction

For scenarios where the model responds to user prompts:

# Create TokenSets for Alice and Cat interaction
alice_talk = mtp.TokenSet(tokens=(tree, alice, talk))
cat_talk = mtp.TokenSet(tokens=(tree, cat, talk))

# Create a user instruction for Alice asking the Cat questions
alice_cat_instruction_leave = mtp.ExtendedInstruction(
    context=[alice_talk, cat_talk, alice_talk],  # Last TokenSet must contain at least one UserToken
    final=disappear,
    name="alice_cat_instruction_leave"
)

Adding Samples

add_sample() parameters:
- inputs: List of context snippets that will be added to the Instruction (must match the context TokenSets)
- response_string: The response provided by the model as a string
- value: Optional numerical value (required if final Token is a NumToken)

# Samples must be made on their associated TokenSets
sample_context_1 = alice_talk.create_snippet(
    string="I don't much care where—"
)
sample_context_2 = cat_talk.create_snippet(
    string="Then it doesn't matter which way you go."
)
sample_context_3 = alice_talk.create_snippet(
    string="Can you tell me which way I ought to go?"
)

alice_cat_instruction_leave.add_sample(
    input_snippets=[sample_context_1, sample_context_2, sample_context_3],
    response_string="Then I'll do it twice as much, since nervousness is such a curious flavor."
)

Guardrails: Safety Mechanisms

Guardrails provide safety mechanisms for user interactions by defining what constitutes good vs. bad user prompts and how the model should respond to inappropriate inputs.

Creating Guardrails

# Create a guardrails
guardrail = mtp.Guardrail(
    good_prompt="Quote being spoken with 1-20 words",
    bad_prompt="Quote being spoken that is irrelevant and off topic with 1-20 words",
    bad_output="Are you as mad as me?"
)

# Add examples of bad prompts
guardrail.add_sample("explain quantum mechanics.")
guardrail.add_sample("who will win the next american election?")
guardrail.add_sample("what is the capital of Spain?")

Applying Guardrails

Guardrails are applied to TokenSets that contain user tokens.

A TokenSet can have at most one guardrail, but guardrails can be reused.

# Apply guardrails to a user TokenSet
tree_alice_talk.set_guardrail(guardrail)

Guardrail Requirements

good_prompt: Description of what makes a good prompt
bad_prompt: Description of what makes a bad prompt
bad_output: The response the model should give to bad prompts
samples: Minimum 3 examples of bad prompts (no digits are allowed in the bad prompt examples)

Saving Your Model

Once you've created your tokens, instructions, and guardrails, you can save your model training protocol:

# Save the protocol
protocol.save()
protocol.template()

Generated Files

When you save your model, two files are created:

1. `{name}_model.json`

This is the main model training protocol file that contains:

Context: All background information you added with protocol.add_context()
Tokens: All your custom tokens with their keys and properties
Special Tokens: System tokens like <BOS>, <EOS>, <RUN>, <PAD>
Instructions: All your training patterns and samples
Guardrails: Safety mechanisms for user interactions
Numbers: Number ranges for NumTokens

This file is what you submit to Databiomes for model training.

2. `{name}_template.json`

This is a reference file that shows:

Example Usage: Valid input/output format for your model
All Combinations: Complete list of all possible token combinations
Model Input/Output: Structure showing how data flows through your model

Use this file to understand how your model expects to receive and format data.

Schema Files

JSON Schema files are available in schemas/{version}/ directories for protocol and template validation.

The template file helps you understand the expected format when using your trained model, while the model file contains all the training data needed to create your specialized language model.

Project details

These details have been verified by PyPI

Project links

Owner

Databiomes

GitHub Statistics

Maintainers

klempka

These details have not been verified by PyPI

Project links

Documentation

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.8.8

Apr 30, 2026

0.8.7

Apr 30, 2026

0.8.6

Apr 30, 2026

0.8.5

Apr 29, 2026

0.8.4

Apr 28, 2026

0.8.3

Apr 28, 2026

0.8.2

Apr 16, 2026

0.8.1

Apr 15, 2026

0.8.0

Apr 15, 2026

0.7.0

Apr 13, 2026

0.6.1

Apr 2, 2026

0.6.0

Mar 27, 2026

0.5.6

Mar 18, 2026

0.5.5

Mar 17, 2026

0.5.4

Mar 11, 2026

0.5.3

Mar 10, 2026

0.5.2

Mar 9, 2026

0.5.1

Mar 8, 2026

0.5.0

Mar 7, 2026

0.4.7

Mar 5, 2026

0.4.6

Mar 3, 2026

This version

0.4.5

Mar 2, 2026

0.4.4

Feb 27, 2026

0.4.3

Feb 27, 2026

0.4.2

Feb 27, 2026

0.4.1

Feb 25, 2026

0.4.0

Feb 24, 2026

0.3.3

Feb 23, 2026

0.3.2

Feb 23, 2026

0.3.1

Feb 10, 2026

0.3.0

Feb 10, 2026

0.2.0

Nov 13, 2025

0.1.7

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_train_protocol-0.4.5.tar.gz (73.8 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

model_train_protocol-0.4.5-py3-none-any.whl (59.2 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file model_train_protocol-0.4.5.tar.gz.

File metadata

Download URL: model_train_protocol-0.4.5.tar.gz
Upload date: Mar 2, 2026
Size: 73.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for model_train_protocol-0.4.5.tar.gz
Algorithm	Hash digest
SHA256	`911875fc7a495d7b26aabef63e8364a4a7798236b33f49daa27349724c58a736`
MD5	`d2d0a4c5cce885da85bac87693efc5d1`
BLAKE2b-256	`7f99cdb8779590581de2cb7460425d9e2ebd4d2df559323c9b48613799c37c03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for model_train_protocol-0.4.5.tar.gz:

Publisher: python-publish.yml on databiomes/modeltrainprotocol

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: model_train_protocol-0.4.5.tar.gz
- Subject digest: 911875fc7a495d7b26aabef63e8364a4a7798236b33f49daa27349724c58a736
- Sigstore transparency entry: 1011836288
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: databiomes/modeltrainprotocol@8fae5404ecbdefa478b1469ba77adc270eb38cb8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/databiomes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8fae5404ecbdefa478b1469ba77adc270eb38cb8
- Trigger Event: workflow_dispatch

File details

Details for the file model_train_protocol-0.4.5-py3-none-any.whl.

File metadata

Download URL: model_train_protocol-0.4.5-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 59.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for model_train_protocol-0.4.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea1b7342fe8f656e83898bb6d45f3c4f543880cddc8862b65848fa5ac3c88c10`
MD5	`0011d9294e8d00caef52bf163fb2b1bd`
BLAKE2b-256	`eaa62d7d22a524d17c708f79a8ecd2079de3c1d7ac0939917f84c4f9c31dcf28`

See more details on using hashes here.

Provenance

The following attestation bundles were made for model_train_protocol-0.4.5-py3-none-any.whl:

Publisher: python-publish.yml on databiomes/modeltrainprotocol

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: model_train_protocol-0.4.5-py3-none-any.whl
- Subject digest: ea1b7342fe8f656e83898bb6d45f3c4f543880cddc8862b65848fa5ac3c88c10
- Sigstore transparency entry: 1011836337
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: databiomes/modeltrainprotocol@8fae5404ecbdefa478b1469ba77adc270eb38cb8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/databiomes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8fae5404ecbdefa478b1469ba77adc270eb38cb8
- Trigger Event: workflow_dispatch

model-train-protocol 0.4.5

Navigation

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Model Train Protocol (MTP)

Getting Started

Creating a Model Train Protocol

System Architecture

Tokens: The Foundation

Token Types

Basic Token

UserToken

NumToken

Token Properties

TokenSets: Combining Tokens

Creating TokenSets

TokenSet Properties

Creating Snippets

Instructions: Training Patterns

Instruction

Parameters

Create the Instruction

Adding Samples

ExtendedInstruction

Parameters

Create the ExtendedInstruction

Adding Samples

Guardrails: Safety Mechanisms

Creating Guardrails

Applying Guardrails

Guardrail Requirements

Saving Your Model

Generated Files

1. {name}_model.json

2. {name}_template.json

Schema Files

Project details

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. `{name}_model.json`

2. `{name}_template.json`