Utilities for building code and document fine-tuning datasets.
Project description
fine_tuning_dataset_preparation
Comprehensive toolkit for turning codebases and Markdown documentation into fine-tuning datasets for LLMs. It ships a code pipeline (tree-sitter extraction, instruction generation, optional paraphrasing), a document pipeline (Markdown segmentation, exhaustive Q/A generation, multi-file support), shared exporters for Gemini/OpenAI/Q/A JSONL, and configurable prompts. Tests run on pytest to keep changes safe.
Installation
pip install -e .
Python 3.10+ is recommended. Before running pipelines, set your LLM credentials:
GOOGLE_API_KEYfor Gemini (default)- or
OPENAI_API_KEY/ANTHROPIC_API_KEYif you switch providers
Code pipeline
Create instruction datasets from a repository (single or multiple projects).
CLI (see examples/run_code_pipeline.py):
python examples/run_code_pipeline.py
Programmatic use:
from fine_tuning_dataset_preparation.code_dataset import PromptConfig
from fine_tuning_dataset_preparation.code_dataset.pipeline import code_pipeline
code_pipeline(
project_path="path/to/repo",
multi_project=True, # treat subfolders as projects
dataset_path="dataset.jsonl",
llm_provider="gemini",
model_name="gemini-2.5-flash",
instruction_concurrency=8,
instruction_temperature=0.7,
prompt_config=PromptConfig(
instruction_hint="Keep instructions concise and actionable.",
paraphrase_hint="Return one alternative phrasing.",
),
paraphrase_variations=1, # optional paraphrasing
paraphrase_temperature=0.9,
exports=[
{"target": "gemini", "output_path": "gemini_dataset.jsonl", "options": {"jsonl": True}},
{"target": "openai", "output_path": "openai_dataset.jsonl", "options": {"jsonl": True}},
],
)
Key arguments: project_path, multi_project, instruction_concurrency, instruction_temperature, prompt_config, optional paraphrase_*, and exports with targets gemini or openai.
Document pipeline
Generate Q/A datasets from Markdown. You can point to a single file, a directory, or a list of files.
CLI (see examples/run_document_pipeline.py):
python examples/run_document_pipeline.py
Programmatic use:
from fine_tuning_dataset_preparation.document_dataset import run_document_pipeline
from fine_tuning_dataset_preparation.document_dataset.dataset import DocumentPromptConfig
run_document_pipeline(
markdown_dir="docs", # or markdown_path="file.md" or markdown_paths=["a.md", "b.md"]
output_path="document_dataset.json",
min_total_pairs=1,
llm_provider="gemini",
model_name="gemini-2.5-pro",
prompt_config=DocumentPromptConfig(
system_message="Use only the provided docs; end with attribution.",
instructions=[
"Use only supplied documentation fragments.",
"If missing, say it is not specified.",
"End every answer with the attribution line.",
],
attribution="Information sourced from ACME Docs © 2025.",
),
exports=[
{"target": "qa_openai", "output_path": "doc_openai.jsonl", "options": {"jsonl": True}},
{"target": "qa_gemini", "output_path": "doc_gemini.jsonl", "options": {"jsonl": True}},
{"target": "qa_jsonl", "output_path": "doc_pairs.jsonl"},
],
)
Key arguments: markdown_path or markdown_dir or markdown_paths, min_total_pairs, prompt_config for tone/attribution, and exports with Q/A targets qa_openai, qa_gemini, qa_jsonl.
Exporters
All exporters live in fine_tuning_dataset_preparation/common/exporters. Use them directly or via the pipeline exports argument.
from fine_tuning_dataset_preparation.common.exporters import export_dataset
export_dataset(
target="qa_openai", # gemini | openai | qa_openai | qa_gemini | qa_jsonl
output_path="out.jsonl",
pairs=[{"question": "...", "answer": "..."}], # or instruction records when using instruction targets
options={"jsonl": True},
)
Instruction targets: gemini, openai. Q/A targets: qa_openai, qa_gemini, qa_jsonl.
Project structure
fine_tuning_dataset_preparation/code_dataset: tree-sitter extraction, instruction generation, paraphrasingfine_tuning_dataset_preparation/document_dataset: Markdown ingestion, Q/A generation, prompt helpersfine_tuning_dataset_preparation/common: LLM utilities, text helpers, exportersexamples/: runnable scripts for code and document pipelines, plus export helpertests/: pytest suite organized by domain
Testing
pytest
pytest --cov=. --cov-report=term-missing
Tips
- Match provider extras to the model you choose; the pinned requirements already bring tree-sitter grammars.
- Export your API key(s) before running examples to avoid partial or templated outputs.
- Pick models by task: fast models (Gemini 2.5 Flash, small OpenAI tiers) for bulk coverage/paraphrases; higher-fidelity models (Gemini 2.5 Pro, GPT-4.1) for final passes or sensitive Q/A. Raise temperature (0.7–0.9) for paraphrasing; keep it lower (0.2–0.5) for deterministic instruction/Q/A generation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fine_tuning_dataset_preparation-0.1.0.tar.gz.
File metadata
- Download URL: fine_tuning_dataset_preparation-0.1.0.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6769ebb312887caa18aaf802a4dda83351974f287056c9805c1cb3d712a774fb
|
|
| MD5 |
0a43e01ebeab250d82966f6ea3efe301
|
|
| BLAKE2b-256 |
8309b9320b6a205d4ab44bd8160fdd3264d09703603b4dff5964d3b3a574876a
|
File details
Details for the file fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a07ff2355a2b943a41daadc2ea5b43e37ad4a3ba62978dc000bb162f17df8d2
|
|
| MD5 |
4b0d80f92b4dd9d192a980aa324c68e5
|
|
| BLAKE2b-256 |
9e470d146e4929332bfa72c4cecae81d01bb2563069245becb8218b8ec2e51d8
|