Machine learning framework for building specialist models.
Project description
Osma is a powerful framework designed to significantly streamline the process of fine-tuning language models using data curated by larger, more capable teacher models, in effort to outperform the teacher models. It provides a structured approach to defining the curation process using signatures, generating high-quality training datasets, and fine-tuning local models. Osma is inspired by the research done by Stanford University's Natural Language Processing (NLP) Group, and is in alpha development.
Features
- Dataset Management: Easy loading, manipulation, and saving of datasets.
- Structured Signatures: Define strict input/output schemas ensuring consistency in generated data.
- Teacher-Student Workflow: Use a managed model to curate training examples from raw data.
- Trainset Curation: Automatically generate reasoning and labels for your dataset.
- Filtering: Mechanisms to validate and filter generated data against ground truth or custom logic.
- Local Fine-Tuning: Seamlessly fine-tune local models using curated datasets.
- Evaluation: Tools to evaluate model performance against test sets.
Simple Example
import osma
from typing import Literal
# Load and shuffle data
ds = osma.Dataset("data.jsonl").shuffle()
# Define the task signature with inputs and outputs
classes = Literal["positive", "negative"]
sg = osma.Signature(
osma.InputFields("text"),
osma.OutputField("sentiment", classes),
reasoning=True
)
# Initialize the Teacher Model
teacher = osma.LanguageModel("gemini/gemini-1.5-flash")
# Curate a training set
trainset = osma.Trainset(ds.range(0, 500), sg, teacher)
trainset.save("train.jsonl")
# Fine-tune a local Student Model
student = osma.LanguageModel("google/gemma-2-2b-it", provider=osma.ModelProvider.LOCAL)
student.train(trainset)
# Run Inference
print(student(sg, text="I love this framework!"))
Installation
Using uv:
uv install osma
Using pip:
pip install osma
Environment Variables
To use Osma, you must export the necessary keys for the models you intend to use.
HF_TOKEN: Required for accessing open-source models (student models).
When using Osma to curate a trainset, you will need to specify the appropriate API key for the managed model's provider:
GEMINI_API_KEY: Required if using Google Gemini as a teacher.OPENAI_API_KEY: Required if using OpenAI models as a teacher.
Note: the above is a non-exhaustive list - you can find the appropriate API key for your model provider in the documentation for that provider.
Key Methods
Dataset
Initialize a dataset from a JSONL file.
ds = osma.Dataset("path/to/data.jsonl")
Randomly shuffle the dataset rows.
ds = ds.shuffle()
Select a subset of rows based on index range.
ds = ds.range(0, 100)
Return the first n rows of the dataset.
ds = ds.head(5)
Save the dataset to a file.
ds.save("output.jsonl")
Signature
Define a task signature with input fields, output fields, and optional reasoning.
sg = osma.Signature(osma.InputFields("input_col"), osma.OutputField("output_name", str), reasoning=True)
Trainset
Curate a new trainset by processing a dataset with a teacher model.
ts = osma.Trainset(ds, sg, teacher_model)
Load an existing trainset from a file.
ts = osma.Trainset("train.jsonl")
Filter rows based on a comparison function between generated and source data.
ts = ts.filter(ds, lambda x, y: x['field'] == y['field'])
Save the trainset to a file.
ts.save("curated.jsonl")
Randomly shuffle the trainset rows.
ts = ts.shuffle()
Select a subset of the trainset based on index range.
ts = ts.range(0, 100)
Return the first n rows of the trainset.
ts = ts.head(5)
Model
Initialize a managed teacher model using the provider/model string.
teacher = osma.LanguageModel("gemini/gemini-1.5-flash")
Initialize a local student model.
student = osma.LanguageModel("google/gemma-2b", provider=osma.ModelProvider.LOCAL)
Generate output for a given signature and specific input arguments.
result = model(sg, text="example input")
Fine-tune the local model on the provided trainset.
student.train(trainset)
Evaluate the model on a test dataset using a scoring function.
results = student.evaluate(sg, eval_ds, lambda res, row: res['val'] == row['val'])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file osma_ai-0.1.0.tar.gz.
File metadata
- Download URL: osma_ai-0.1.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00dc496e2ea26d4807b3860db11772a64c289e8e08e74ade0e0ddd265da563f4
|
|
| MD5 |
dd2f82954cb66544905d107e62438c15
|
|
| BLAKE2b-256 |
94922bdb7afce71405445aa63f06711870483f6b0910384ecd0d6f6cf403a979
|
File details
Details for the file osma_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: osma_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f3a5d0fc2ad3f3d08fd44ec1ed7fc047269acf2900e3080b5958c412d191fdc
|
|
| MD5 |
5b88a9732cff217508a20d565eb34249
|
|
| BLAKE2b-256 |
7e1da528c084e8a0d320dfe5b07befda139dc23cdb73402f04b0a17dab9175c0
|