GitHub issue-to-diff dataset pipeline for supervised fine-tuning
Project description
patch
patch is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.
It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.
What this project does
- Collects issue -> merged PR pairs from curated open-source repositories
- Fetches review counts, changed files, and unified diffs
- Applies quality filters to keep focused, learnable examples
- Formats output for ChatML-style instruction fine-tuning
- Merges and splits into deterministic
train.jsonl/test.jsonl - Pushes a
DatasetDictto Hugging Face Hub
Project layout
patch/
├── pyproject.toml
├── .env.example
├── Makefile
├── config/
│ └── repos.py
├── src/
│ └── patch/
│ ├── collect.py
│ ├── process.py
│ ├── merge.py
│ ├── push.py
│ ├── peek.py
│ ├── manifest.py
│ └── filters.py
├── data/
│ ├── raw/
│ ├── processed/
│ └── hf_upload/
└── scripts/
├── run_collect.py
├── run_process.py
├── run_merge.py
├── run_push.py
└── run_peek.py
Requirements
- Python 3.11+
- uv
- GitHub token with API access
- Hugging Face account/token (or CLI login)
Setup
make sync
make init-env
Then edit .env with your values:
GITHUB_TOKENHF_REPO_IDHF_TOKEN(optional if usinghf auth login)DATA_DIR(optional, defaults to./data)
Run the pipeline
All repos
make collect
make process
make merge
make push
Single repo
make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow
One-command full run
make pipeline
Peek mode (GraphQL preview)
Use peek when you want to inspect GraphQL request structure and one sample issue/PR result without writing files.
make peek
- Hardcoded target:
pola-rs/polars - Prints a request example plus a sample normalized record to stdout
- Does not write dataset data to disk
- Includes full issue body, full PR body, changed files, and unified diff text in the sample
Output formats
Raw record (data/raw/*.jsonl)
Each line is a JSON object with fields such as:
repo,language,domain,license,collected_atissue_number,issue_title,issue_body,issue_labels,issue_created_atpr_number,pr_title,pr_body,pr_merged_at,base_branch,base_sha,merge_sha,closing_pr_confidencediff,review_count,changed_files,additions,deletions,has_tests,test_files_changed
Processed record (data/processed/*.jsonl)
Each line is formatted for instruction SFT:
messages:system,user,assistantmetadata: repo/language/domain identifiers plus issue/pr and provenance fields
HF upload files (data/hf_upload/)
train.jsonltest.jsonl
Quality filters
Records are kept only if they pass all checks:
- issue body length >= 100 chars
- changed diff lines between 5 and 500
- review count >= 1
- changed files <= 10
Resumability and invariants
data/raw/is append-only- collection is safe to re-run and deduplicates by issue number
data/manifest.jsonis the source of truth for per-repo progress- manifest writes are flushed during collection for crash safety
- each stage runs independently from files on disk
Authentication notes
GitHub
GITHUB_TOKENis required for collection and peek commands
Hugging Face
Push supports either auth path:
- Set
HF_TOKENin.env, or - Run
hf auth loginand use cached CLI credentials
Helpful targets
make help
make check
make help: list all available commandsmake check: compile Python files for quick syntax validation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file patch_sft-0.1.0.tar.gz.
File metadata
- Download URL: patch_sft-0.1.0.tar.gz
- Upload date:
- Size: 125.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c20261c0e97301ca541ecf84a45b592a7c705572aa2dbc88e269d009c0767b4
|
|
| MD5 |
cc1ff49ba26559387cdbb26dd833c74a
|
|
| BLAKE2b-256 |
fa343929683cb032cec5a7ed60000c422baf6e99cb536b65d5c7b33ef33ec08f
|
File details
Details for the file patch_sft-0.1.0-py3-none-any.whl.
File metadata
- Download URL: patch_sft-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ddd1559b2e528f5e364a3f70f6b9d504fb3e97a82b57cb4e1669e2e74fa2959
|
|
| MD5 |
9303192c29ca14c59d574590f8cbea06
|
|
| BLAKE2b-256 |
0ca379fa49f7761f7f30b18d9a1d665d43dc8c8c5de66c303d1556ca6c9ae72c
|