Structured synthetic text data generation for SFT and distillation.
Project description
text-albumentations
text-albumentations is a synthetic data generation engine for text.
The goal is to help generate instruction-tuning and distillation datasets from existing text corpora by applying structured augmentations over passages.
This is built for the practical case where good supervised fine-tuning often requires more examples than you already have, and where synthetic data generation is one of the fastest ways to create task-shaped training data from raw documents.
Why This Exists
Modern LLM workflows often need:
- synthetic SFT data
- task-specific distillation data
- multiple renderings of the same semantic content
- structured supervision generated from long-form text
If you already have long amounts of text, you can usually derive many useful supervision targets from it:
- bullet-point summaries
- QA pairs
- rephrasings
- continuation tasks
- retrieval examples
- comparisons
- knowledge graph triplets
Instead of treating synthetic data generation as one giant prompt, this project breaks it into explicit, composable pieces.
Ideology
The core idea is:
structured generation + simple priors -> dataset
Structured generation gives you typed intermediate outputs using Pydantic schemas.
Simple priors give you the task shape:
- "extract bullets"
- "produce QA pairs"
- "find the answering passage"
- "serialize the response as markdown/json/etc"
That combination is easier to reason about than unstructured free-form prompting. It also makes the pipeline more extensible: you can swap prompts, schemas, response formats, runtimes, and adapters without rewriting the whole system.
Current Capabilities
The project currently supports:
- single-chunk augmentations
- multi-chunk augmentations
- batched augmentation execution for many passages with one shared schema
- typed structured outputs with Pydantic
- Alpaca-format dataset generation
- response-format control for the Alpaca
outputfield - sync and async generation runtimes
- Outlines-backed local models
- Outlines-backed OpenAI models
- long-text ingestion with fixed-size character chunking
- JSONL dataset writing
Built-in augmentations:
| Augmentation | Type | What it generates |
|---|---|---|
bullets |
Single chunk | Extracts key points from a passage and renders them as bullet-style outputs. |
qa_pairs |
Single chunk | Produces question-answer pairs grounded in one passage. |
rephrase |
Single chunk | Rewrites a passage into a clearer or more elaborated version without changing meaning. |
continuation |
Single chunk | Produces continuation-style completions derived from the passage. |
triplets |
Single chunk | Extracts subject-relation-object knowledge graph triplets. |
comparison |
Multi chunk | Compares two passages and generates a structured comparison. |
retrieval |
Multi chunk | Builds retrieval-style supervision by pairing questions with the passage that answers them, or with no-answer cases. |
Architecture
The main abstractions are:
-
BaseSingleChunkAugmentationandBaseMultiChunkAugmentationThese define the task contract: schema, prompt, response formats, generation knobs, and dataset construction. -
BaseResponseFormatThis controls how the Alpacaoutputfield should be represented and can also modify the system prompt with format-specific instructions. -
BaseAlpacaAdapterThis converts typed structured outputs into Alpaca rows. -
ModelRuntimeThis is the model execution interface. Current implementations support local Outlines models and OpenAI-through-Outlines models. -
AugmentationRunnerThis binds together:- input data
- a runtime
- an augmentation
Usage
Minimal Local Example
import mlx_lm
import outlines
from text_albumentations import OutlinesModel, run_augmentation
from text_albumentations.tasks.bullets import bullet_augmentation
model = outlines.from_mlxlm(*mlx_lm.load("mlx-community/Qwen3.5-4B-OptiQ-4bit"))
runtime = OutlinesModel(model=model)
rows = run_augmentation(
"The Transformer replaces recurrence with attention and improves parallelization.",
bullet_augmentation,
runtime,
)
for row in rows:
print(row.model_dump_json())
See examples/minimal.py.
OpenAI Sync
import openai
import outlines
from text_albumentations import OutlinesModel, run_augmentation
from text_albumentations.tasks.bullets import bullet_augmentation
model = outlines.from_openai(openai.OpenAI(), "gpt-5.4-nano")
runtime = OutlinesModel(model, max_tokens_parameter="max_completion_tokens")
rows = run_augmentation("some passage", bullet_augmentation, runtime)
OpenAI Async
import asyncio
import openai
import outlines
from text_albumentations import OutlinesModel, arun_augmentation
from text_albumentations.tasks.bullets import bullet_augmentation
async def main():
model = outlines.from_openai(openai.AsyncOpenAI(), "gpt-5.4-nano")
runtime = OutlinesModel(
model,
async_mode=True,
total_concurrent_calls=4,
max_tokens_parameter="max_completion_tokens",
)
rows = await arun_augmentation("some passage", bullet_augmentation, runtime)
print(len(rows))
asyncio.run(main())
Transformers Local Model
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from text_albumentations import OutlinesModel, run_augmentation
from text_albumentations.tasks.bullets import bullet_augmentation
hf_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-1b-it",
torch_dtype="auto",
device_map="auto",
)
hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
model = outlines.from_transformers(hf_model, hf_tokenizer)
runtime = OutlinesModel(model, max_tokens_parameter="max_new_tokens")
rows = run_augmentation("some passage", bullet_augmentation, runtime)
See the examples/ directory for the current Transformers examples.
Batch Augmentation Over Multiple Passages
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from text_albumentations import OutlinesModel, run_batch_augmentation
from text_albumentations.tasks.bullets import BulletAugmentation
hf_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-1b-it",
torch_dtype="auto",
device_map="auto",
)
hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
model = outlines.from_transformers(hf_model, hf_tokenizer)
runtime = OutlinesModel(model, max_tokens_parameter="max_new_tokens")
augmentation = BulletAugmentation(max_tokens=128, variations=0)
rows = run_batch_augmentation(
[
"The Transformer replaces recurrence with attention and improves parallelization.",
"Outlines constrains generation so outputs match the expected structure.",
"Synthetic supervision can be derived from raw documents with task-shaped prompts.",
"Batch decoding is useful when many passages share the same schema and augmentation.",
],
augmentation,
runtime,
)
See examples/batch_augmentation.py.
Long Text To JSONL
import openai
import outlines
from text_albumentations import OutlinesModel, save_long_text_dataset
from text_albumentations.tasks.bullets import bullet_augmentation
model = outlines.from_openai(openai.OpenAI(), "gpt-5.4-nano")
runtime = OutlinesModel(model, max_tokens_parameter="max_completion_tokens")
save_long_text_dataset(
text=long_text,
output_jsonl="out.jsonl",
augmentation=bullet_augmentation,
runtime=runtime,
chunk_size_chars=300,
)
See examples/long_text_to_jsonl.py.
Multiple Augmentations Over The Same Passage
import openai
import outlines
from text_albumentations import OutlinesModel, run_augmentation
from text_albumentations.tasks.bullets import bullet_augmentation
from text_albumentations.tasks.rephrase import rephrase_augmentation
model = outlines.from_openai(openai.OpenAI(), "gpt-5.4-nano")
runtime = OutlinesModel(model, max_tokens_parameter="max_completion_tokens")
rows = []
rows.extend(run_augmentation("some passage", bullet_augmentation, runtime))
rows.extend(run_augmentation("some passage", rephrase_augmentation, runtime))
See examples/multiple_augmentations.py.
Custom Preprocessing Model
You can also make the augmentation input itself be a custom Pydantic model instead of a raw string.
See examples/custom_preprocessing.py.
Extensibility
The project is designed so users can extend it in layers.
1. Add A New Augmentation
Subclass one of:
BaseSingleChunkAugmentationBaseMultiChunkAugmentation
Define:
- a Pydantic schema
- a system prompt
build_user_message(...)- one or more response formats
2. Add A New Response Format
Subclass BaseResponseFormat if you want to control:
- how the format modifies the system prompt
- how the final Alpaca
outputfield is rendered
For common Alpaca row generation, AlpacaResponseFormat is usually enough.
3. Add A New Adapter
Subclass BaseAlpacaAdapter to convert a typed structured output into one or more Alpaca rows.
One structured output can expand into multiple rows.
4. Add A New Runtime
Implement ModelRuntime if you want to support a new backend.
That keeps model execution separate from:
- augmentation semantics
- prompt construction
- dataset adapters
- response serialization
This separation is intentional. The project should let you swap the model layer without rewriting the dataset logic.
Philosophy On Synthetic Data
This project does not assume synthetic data is magic.
It assumes:
- synthetic data works best when the task shape is explicit
- typed intermediate representations are easier to control
- simple priors beat vague giant prompts
- extensibility matters because different teams want different schemas, formats, and runtimes
The aim is not "generate random data."
The aim is to turn raw text into useful supervision signals for SFT and distillation in a way that is structured, inspectable, and easy to extend.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_albumentations-0.4.0.tar.gz.
File metadata
- Download URL: text_albumentations-0.4.0.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3458e3ed7a7707cf7f6cfb50984fb31798b2d18042c78cedb0f5fbe77810afaa
|
|
| MD5 |
71a694bb93ae1b298668adc28d65989f
|
|
| BLAKE2b-256 |
878836a30ea3ec42047d29d2667ea932849d5cc3ca387ec70f6bd9812552186a
|
File details
Details for the file text_albumentations-0.4.0-py3-none-any.whl.
File metadata
- Download URL: text_albumentations-0.4.0-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30a0fd386663911219c39e516f8187445f141f2cd68d64ddb8767265f2577670
|
|
| MD5 |
cdd231ce2e10bb13cbb85b202fb62eef
|
|
| BLAKE2b-256 |
b608689189b662bdc6fc574252c64cdf66143ee6eb78fd9336d2c96b3cc11ede
|