GitHub issue-to-diff dataset pipeline for supervised fine-tuning
Project description
patch-sft
patch-sft is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.
It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.
Using the published dataset
If you just want the pre-built dataset, no GitHub token or local pipeline needed:
import patch
ds = patch.load("your-hf-username/patch-sft")
print(ds)
# DatasetDict({
# train: Dataset({...}),
# test: Dataset({...}),
# })
df = ds["train"].to_pandas()
Or install and use it as a dependency in another project:
# pyproject.toml
dependencies = ["patch-sft>=0.1.0"]
import patch
# Download from HuggingFace Hub
ds = patch.load("your-hf-username/patch-sft")
# Train split only
train_ds = patch.load("your-hf-username/patch-sft", split="train")
Running your own collection
To collect from GitHub and build your own dataset:
Requirements
- Python 3.11+
- uv
- GitHub personal access token
- Hugging Face account/token (for pushing)
Setup
git clone https://github.com/your-username/patch-sft
cd patch-sft
make sync
make init-env
Edit .env with your values:
GITHUB_TOKEN=ghp_... # required for collection
HF_REPO_ID=yourname/patch-sft
HF_TOKEN=hf_... # optional if using `huggingface-cli login`
DATA_DIR=./data # optional, defaults to ./data
PYPI_API_TOKEN=pypi-... # only needed for `make publish`
Run the pipeline
# All repos
make collect
make process
make merge
make push
# Single repo
make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow
# Full pipeline in one shot
make pipeline
Programmatic collection
import patch
from patch import RepoConfig
patch.collect(
repos=[
RepoConfig("apache", "arrow", "python", "data-engineering"),
RepoConfig("fastapi", "fastapi", "python", "web"),
],
token="ghp_...", # or set GITHUB_TOKEN env var
data_dir="./data",
)
CLI
After installation, the patch-sft command is available:
patch-sft collect [--repo owner/repo]
patch-sft process [--repo owner/repo]
patch-sft merge
patch-sft push
patch-sft peek
Peek mode
Preview GraphQL request structure and one sample record without writing any files:
make peek
# or
patch-sft peek
Hardcoded target is pola-rs/polars. Prints a request example and a normalized sample record to stdout.
Project layout
patch-sft/
├── pyproject.toml
├── .env.example
├── Makefile
├── src/
│ └── patch/
│ ├── __init__.py # public API: collect(), load(), RepoConfig, REPOS
│ ├── cli.py # patch-sft CLI entry point
│ ├── repos.py # curated repo list and RepoConfig
│ ├── collect.py # GitHub GraphQL + REST collection
│ ├── process.py # quality filters and SFT formatting
│ ├── merge.py # train/test split → Parquet
│ ├── push.py # HuggingFace Hub upload
│ ├── peek.py # single-sample GraphQL preview
│ ├── manifest.py # per-repo progress tracking
│ └── filters.py # record quality filters
└── data/
├── raw/ # per-repo JSONL (append-only)
├── processed/ # filtered and formatted JSONL
└── hf_upload/ # train.parquet / test.parquet
Output formats
Raw record (data/raw/*.jsonl)
Each line is a JSON object:
repo,language,domain,license,collected_atissue_number,issue_title,issue_body,issue_labels,issue_created_atpr_number,pr_title,pr_body,pr_merged_at,base_branch,base_sha,merge_sha,closing_pr_confidencediff,review_count,changed_files,additions,deletions,has_tests,test_files_changed
Failed diff fetches are written to data/raw/*.errors.jsonl and retried automatically on the next run.
Processed record (data/processed/*.jsonl)
Formatted for instruction SFT:
messages:[{role: system}, {role: user}, {role: assistant}]metadata: repo/language/domain identifiers plus issue/PR provenance fields
HF upload (data/hf_upload/)
train.parquettest.parquet
Quality filters
Records are kept only if they pass all checks:
- Issue body length ≥ 100 characters
- Changed diff lines between 5 and 500
- At least 1 PR review
- Changed files ≤ 10
Resumability
data/raw/is append-only; collection deduplicates by issue numberdata/manifest.jsontracks per-repo progress and is flushed during collection- Each pipeline stage is independent and reads from files on disk
- Re-running collection is safe — already-collected issues are skipped
Publishing to PyPI
make publish
Requires ~/.pypirc with a valid PyPI token (username __token__, password pypi-...), or PYPI_API_TOKEN set in your environment.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file patch_sft-0.1.1.tar.gz.
File metadata
- Download URL: patch_sft-0.1.1.tar.gz
- Upload date:
- Size: 125.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fef6c8fe52b9ec9e0a04c5e52c5721b87f25bbd40d7ac5e4a8ad2e856fb4e56d
|
|
| MD5 |
6b0d6bcc46a88f06ac10886ffb864ccb
|
|
| BLAKE2b-256 |
f57332f2a8678a3179e007937fe85c49df3389e804495ad18591bc6c4ce4286a
|
File details
Details for the file patch_sft-0.1.1-py3-none-any.whl.
File metadata
- Download URL: patch_sft-0.1.1-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94c2d8163ffcb8a1bc3ed135d117320c46d83601b26efebb20f1be0faf864fbf
|
|
| MD5 |
1162864b9675862b40ef50716dd84dcf
|
|
| BLAKE2b-256 |
fa3ca5ed9776e5d245c4fff88dc28f19cb5a540ad0d0a83d63e14fa70d7b358e
|