Skip to main content

Utilities for building code and document fine-tuning datasets.

Project description

fine_tuning_dataset_preparation

Comprehensive toolkit for turning codebases and Markdown documentation into fine-tuning datasets for LLMs. It ships a code pipeline (tree-sitter extraction, instruction generation, optional paraphrasing), a document pipeline (Markdown segmentation, exhaustive Q/A generation, multi-file support), shared exporters for Gemini/OpenAI/Q/A JSONL, and configurable prompts. Tests run on pytest to keep changes safe.

Installation

pip install -e .

Python 3.10+ is recommended. Before running pipelines, set your LLM credentials:

  • GOOGLE_API_KEY for Gemini (default)
  • or OPENAI_API_KEY / ANTHROPIC_API_KEY if you switch providers

Code pipeline

Create instruction datasets from a repository (single or multiple projects).

CLI (see examples/run_code_pipeline.py):

python examples/run_code_pipeline.py

Programmatic use:

from fine_tuning_dataset_preparation.code_dataset import PromptConfig
from fine_tuning_dataset_preparation.code_dataset.pipeline import code_pipeline

code_pipeline(
    project_path="path/to/repo",
    multi_project=True,                # treat subfolders as projects
    dataset_path="dataset.jsonl",
    llm_provider="gemini",
    model_name="gemini-2.5-flash",
    instruction_concurrency=8,
    instruction_temperature=0.7,
    prompt_config=PromptConfig(
        instruction_hint="Keep instructions concise and actionable.",
        paraphrase_hint="Return one alternative phrasing.",
    ),
    paraphrase_variations=1,           # optional paraphrasing
    paraphrase_temperature=0.9,
    exports=[
        {"target": "gemini", "output_path": "gemini_dataset.jsonl", "options": {"jsonl": True}},
        {"target": "openai", "output_path": "openai_dataset.jsonl", "options": {"jsonl": True}},
    ],
)

Key arguments: project_path, multi_project, instruction_concurrency, instruction_temperature, prompt_config, optional paraphrase_*, and exports with targets gemini or openai.

Document pipeline

Generate Q/A datasets from Markdown. You can point to a single file, a directory, or a list of files.

CLI (see examples/run_document_pipeline.py):

python examples/run_document_pipeline.py

Programmatic use:

from fine_tuning_dataset_preparation.document_dataset import run_document_pipeline
from fine_tuning_dataset_preparation.document_dataset.dataset import DocumentPromptConfig

run_document_pipeline(
    markdown_dir="docs",               # or markdown_path="file.md" or markdown_paths=["a.md", "b.md"]
    output_path="document_dataset.json",
    min_total_pairs=1,
    llm_provider="gemini",
    model_name="gemini-2.5-pro",
    prompt_config=DocumentPromptConfig(
        system_message="Use only the provided docs; end with attribution.",
        instructions=[
            "Use only supplied documentation fragments.",
            "If missing, say it is not specified.",
            "End every answer with the attribution line.",
        ],
        attribution="Information sourced from ACME Docs © 2025.",
    ),
    exports=[
        {"target": "qa_openai", "output_path": "doc_openai.jsonl", "options": {"jsonl": True}},
        {"target": "qa_gemini", "output_path": "doc_gemini.jsonl", "options": {"jsonl": True}},
        {"target": "qa_jsonl", "output_path": "doc_pairs.jsonl"},
    ],
)

Key arguments: markdown_path or markdown_dir or markdown_paths, min_total_pairs, prompt_config for tone/attribution, and exports with Q/A targets qa_openai, qa_gemini, qa_jsonl.

Exporters

All exporters live in fine_tuning_dataset_preparation/common/exporters. Use them directly or via the pipeline exports argument.

from fine_tuning_dataset_preparation.common.exporters import export_dataset

export_dataset(
    target="qa_openai",                 # gemini | openai | qa_openai | qa_gemini | qa_jsonl
    output_path="out.jsonl",
    pairs=[{"question": "...", "answer": "..."}],  # or instruction records when using instruction targets
    options={"jsonl": True},
)

Instruction targets: gemini, openai. Q/A targets: qa_openai, qa_gemini, qa_jsonl.

Project structure

  • fine_tuning_dataset_preparation/code_dataset: tree-sitter extraction, instruction generation, paraphrasing
  • fine_tuning_dataset_preparation/document_dataset: Markdown ingestion, Q/A generation, prompt helpers
  • fine_tuning_dataset_preparation/common: LLM utilities, text helpers, exporters
  • examples/: runnable scripts for code and document pipelines, plus export helper
  • tests/: pytest suite organized by domain

Testing

pytest
pytest --cov=. --cov-report=term-missing

Tips

  • Match provider extras to the model you choose; the pinned requirements already bring tree-sitter grammars.
  • Export your API key(s) before running examples to avoid partial or templated outputs.
  • Pick models by task: fast models (Gemini 2.5 Flash, small OpenAI tiers) for bulk coverage/paraphrases; higher-fidelity models (Gemini 2.5 Pro, GPT-4.1) for final passes or sensitive Q/A. Raise temperature (0.7–0.9) for paraphrasing; keep it lower (0.2–0.5) for deterministic instruction/Q/A generation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fine_tuning_dataset_preparation-0.1.0.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file fine_tuning_dataset_preparation-0.1.0.tar.gz.

File metadata

File hashes

Hashes for fine_tuning_dataset_preparation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6769ebb312887caa18aaf802a4dda83351974f287056c9805c1cb3d712a774fb
MD5 0a43e01ebeab250d82966f6ea3efe301
BLAKE2b-256 8309b9320b6a205d4ab44bd8160fdd3264d09703603b4dff5964d3b3a574876a

See more details on using hashes here.

File details

Details for the file fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a07ff2355a2b943a41daadc2ea5b43e37ad4a3ba62978dc000bb162f17df8d2
MD5 4b0d80f92b4dd9d192a980aa324c68e5
BLAKE2b-256 9e470d146e4929332bfa72c4cecae81d01bb2563069245becb8218b8ec2e51d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page