Skip to main content

GitHub issue-to-diff dataset pipeline for supervised fine-tuning

Project description

patch

patch is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.

It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.

What this project does

  • Collects issue -> merged PR pairs from curated open-source repositories
  • Fetches review counts, changed files, and unified diffs
  • Applies quality filters to keep focused, learnable examples
  • Formats output for ChatML-style instruction fine-tuning
  • Merges and splits into deterministic train.jsonl / test.jsonl
  • Pushes a DatasetDict to Hugging Face Hub

Project layout

patch/
├── pyproject.toml
├── .env.example
├── Makefile
├── config/
│   └── repos.py
├── src/
│   └── patch/
│       ├── collect.py
│       ├── process.py
│       ├── merge.py
│       ├── push.py
│       ├── peek.py
│       ├── manifest.py
│       └── filters.py
├── data/
│   ├── raw/
│   ├── processed/
│   └── hf_upload/
└── scripts/
    ├── run_collect.py
    ├── run_process.py
    ├── run_merge.py
    ├── run_push.py
    └── run_peek.py

Requirements

  • Python 3.11+
  • uv
  • GitHub token with API access
  • Hugging Face account/token (or CLI login)

Setup

make sync
make init-env

Then edit .env with your values:

  • GITHUB_TOKEN
  • HF_REPO_ID
  • HF_TOKEN (optional if using hf auth login)
  • DATA_DIR (optional, defaults to ./data)

Run the pipeline

All repos

make collect
make process
make merge
make push

Single repo

make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow

One-command full run

make pipeline

Peek mode (GraphQL preview)

Use peek when you want to inspect GraphQL request structure and one sample issue/PR result without writing files.

make peek
  • Hardcoded target: pola-rs/polars
  • Prints a request example plus a sample normalized record to stdout
  • Does not write dataset data to disk
  • Includes full issue body, full PR body, changed files, and unified diff text in the sample

Output formats

Raw record (data/raw/*.jsonl)

Each line is a JSON object with fields such as:

  • repo, language, domain, license, collected_at
  • issue_number, issue_title, issue_body, issue_labels, issue_created_at
  • pr_number, pr_title, pr_body, pr_merged_at, base_branch, base_sha, merge_sha, closing_pr_confidence
  • diff, review_count, changed_files, additions, deletions, has_tests, test_files_changed

Processed record (data/processed/*.jsonl)

Each line is formatted for instruction SFT:

  • messages: system, user, assistant
  • metadata: repo/language/domain identifiers plus issue/pr and provenance fields

HF upload files (data/hf_upload/)

  • train.jsonl
  • test.jsonl

Quality filters

Records are kept only if they pass all checks:

  • issue body length >= 100 chars
  • changed diff lines between 5 and 500
  • review count >= 1
  • changed files <= 10

Resumability and invariants

  • data/raw/ is append-only
  • collection is safe to re-run and deduplicates by issue number
  • data/manifest.json is the source of truth for per-repo progress
  • manifest writes are flushed during collection for crash safety
  • each stage runs independently from files on disk

Authentication notes

GitHub

  • GITHUB_TOKEN is required for collection and peek commands

Hugging Face

Push supports either auth path:

  1. Set HF_TOKEN in .env, or
  2. Run hf auth login and use cached CLI credentials

Helpful targets

make help
make check
  • make help: list all available commands
  • make check: compile Python files for quick syntax validation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patch_sft-0.1.0.tar.gz (125.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

patch_sft-0.1.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file patch_sft-0.1.0.tar.gz.

File metadata

  • Download URL: patch_sft-0.1.0.tar.gz
  • Upload date:
  • Size: 125.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7c20261c0e97301ca541ecf84a45b592a7c705572aa2dbc88e269d009c0767b4
MD5 cc1ff49ba26559387cdbb26dd833c74a
BLAKE2b-256 fa343929683cb032cec5a7ed60000c422baf6e99cb536b65d5c7b33ef33ec08f

See more details on using hashes here.

File details

Details for the file patch_sft-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: patch_sft-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ddd1559b2e528f5e364a3f70f6b9d504fb3e97a82b57cb4e1669e2e74fa2959
MD5 9303192c29ca14c59d574590f8cbea06
BLAKE2b-256 0ca379fa49f7761f7f30b18d9a1d665d43dc8c8c5de66c303d1556ca6c9ae72c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page