GitHub issue-to-diff dataset pipeline for supervised fine-tuning

Project description

patch

patch is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.

It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.

What this project does

Collects issue -> merged PR pairs from curated open-source repositories
Fetches review counts, changed files, and unified diffs
Applies quality filters to keep focused, learnable examples
Formats output for ChatML-style instruction fine-tuning
Merges and splits into deterministic train.jsonl / test.jsonl
Pushes a DatasetDict to Hugging Face Hub

Project layout

patch/
├── pyproject.toml
├── .env.example
├── Makefile
├── config/
│   └── repos.py
├── src/
│   └── patch/
│       ├── collect.py
│       ├── process.py
│       ├── merge.py
│       ├── push.py
│       ├── peek.py
│       ├── manifest.py
│       └── filters.py
├── data/
│   ├── raw/
│   ├── processed/
│   └── hf_upload/
└── scripts/
    ├── run_collect.py
    ├── run_process.py
    ├── run_merge.py
    ├── run_push.py
    └── run_peek.py

Requirements

Python 3.11+
uv
GitHub token with API access
Hugging Face account/token (or CLI login)

Setup

make sync
make init-env

Then edit .env with your values:

GITHUB_TOKEN
HF_REPO_ID
HF_TOKEN (optional if using hf auth login)
DATA_DIR (optional, defaults to ./data)

Run the pipeline

All repos

make collect
make process
make merge
make push

Single repo

make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow

One-command full run

make pipeline

Peek mode (GraphQL preview)

Use peek when you want to inspect GraphQL request structure and one sample issue/PR result without writing files.

make peek

Hardcoded target: pola-rs/polars
Prints a request example plus a sample normalized record to stdout
Does not write dataset data to disk
Includes full issue body, full PR body, changed files, and unified diff text in the sample

Output formats

Raw record (`data/raw/*.jsonl`)

Each line is a JSON object with fields such as:

repo, language, domain, license, collected_at
issue_number, issue_title, issue_body, issue_labels, issue_created_at
pr_number, pr_title, pr_body, pr_merged_at, base_branch, base_sha, merge_sha, closing_pr_confidence
diff, review_count, changed_files, additions, deletions, has_tests, test_files_changed

Processed record (`data/processed/*.jsonl`)

Each line is formatted for instruction SFT:

messages: system, user, assistant
metadata: repo/language/domain identifiers plus issue/pr and provenance fields

HF upload files (`data/hf_upload/`)

train.jsonl
test.jsonl

Quality filters

Records are kept only if they pass all checks:

issue body length >= 100 chars
changed diff lines between 5 and 500
review count >= 1
changed files <= 10

Resumability and invariants

data/raw/ is append-only
collection is safe to re-run and deduplicates by issue number
data/manifest.json is the source of truth for per-repo progress
manifest writes are flushed during collection for crash safety
each stage runs independently from files on disk

Authentication notes

GitHub

GITHUB_TOKEN is required for collection and peek commands

Hugging Face

Push supports either auth path:

Set HF_TOKEN in .env, or
Run hf auth login and use cached CLI credentials

Helpful targets

make help
make check

make help: list all available commands
make check: compile Python files for quick syntax validation

Project details

Release history Release notifications | RSS feed

0.1.2

Mar 31, 2026

0.1.1

Mar 31, 2026

This version

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patch_sft-0.1.0.tar.gz (125.1 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

patch_sft-0.1.0-py3-none-any.whl (21.8 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file patch_sft-0.1.0.tar.gz.

File metadata

Download URL: patch_sft-0.1.0.tar.gz
Upload date: Mar 31, 2026
Size: 125.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7c20261c0e97301ca541ecf84a45b592a7c705572aa2dbc88e269d009c0767b4`
MD5	`cc1ff49ba26559387cdbb26dd833c74a`
BLAKE2b-256	`fa343929683cb032cec5a7ed60000c422baf6e99cb536b65d5c7b33ef33ec08f`

See more details on using hashes here.

File details

Details for the file patch_sft-0.1.0-py3-none-any.whl.

File metadata

Download URL: patch_sft-0.1.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 21.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ddd1559b2e528f5e364a3f70f6b9d504fb3e97a82b57cb4e1669e2e74fa2959`
MD5	`9303192c29ca14c59d574590f8cbea06`
BLAKE2b-256	`0ca379fa49f7761f7f30b18d9a1d665d43dc8c8c5de66c303d1556ca6c9ae72c`

See more details on using hashes here.

patch-sft 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

patch

What this project does

Project layout

Requirements

Setup

Run the pipeline

All repos

Single repo

One-command full run

Peek mode (GraphQL preview)

Output formats

Raw record (data/raw/*.jsonl)

Processed record (data/processed/*.jsonl)

HF upload files (data/hf_upload/)

Quality filters

Resumability and invariants

Authentication notes

GitHub

Hugging Face

Helpful targets

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Raw record (`data/raw/*.jsonl`)

Processed record (`data/processed/*.jsonl`)

HF upload files (`data/hf_upload/`)