GitHub issue-to-diff dataset pipeline for supervised fine-tuning

Project description

patch-sft

patch-sft is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.

It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.

Using the published dataset

If you just want the pre-built dataset, no GitHub token or local pipeline needed:

import patch

ds = patch.load("your-hf-username/patch-sft")
print(ds)
# DatasetDict({
#     train: Dataset({...}),
#     test:  Dataset({...}),
# })

df = ds["train"].to_pandas()

Or install and use it as a dependency in another project:

# pyproject.toml
dependencies = ["patch-sft>=0.1.0"]

import patch

# Download from HuggingFace Hub
ds = patch.load("your-hf-username/patch-sft")

# Train split only
train_ds = patch.load("your-hf-username/patch-sft", split="train")

Running your own collection

To collect from GitHub and build your own dataset:

Requirements

Python 3.11+
uv
GitHub personal access token
Hugging Face account/token (for pushing)

Setup

git clone https://github.com/your-username/patch-sft
cd patch-sft
make sync
make init-env

Edit .env with your values:

GITHUB_TOKEN=ghp_...        # required for collection
HF_REPO_ID=yourname/patch-sft
HF_TOKEN=hf_...             # optional if using `huggingface-cli login`
DATA_DIR=./data             # optional, defaults to ./data
PYPI_API_TOKEN=pypi-...     # only needed for `make publish`

Run the pipeline

# All repos
make collect
make process
make merge
make push

# Single repo
make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow

# Full pipeline in one shot
make pipeline

Programmatic collection

import patch
from patch import RepoConfig

patch.collect(
    repos=[
        RepoConfig("apache", "arrow", "python", "data-engineering"),
        RepoConfig("fastapi", "fastapi", "python", "web"),
    ],
    token="ghp_...",      # or set GITHUB_TOKEN env var
    data_dir="./data",
)

CLI

After installation, the patch-sft command is available:

patch-sft collect [--repo owner/repo]
patch-sft process [--repo owner/repo]
patch-sft merge
patch-sft push
patch-sft peek

Peek mode

Preview GraphQL request structure and one sample record without writing any files:

make peek
# or
patch-sft peek

Hardcoded target is pola-rs/polars. Prints a request example and a normalized sample record to stdout.

Project layout

patch-sft/
├── pyproject.toml
├── .env.example
├── Makefile
├── src/
│   └── patch/
│       ├── __init__.py     # public API: collect(), load(), RepoConfig, REPOS
│       ├── cli.py          # patch-sft CLI entry point
│       ├── repos.py        # curated repo list and RepoConfig
│       ├── collect.py      # GitHub GraphQL + REST collection
│       ├── process.py      # quality filters and SFT formatting
│       ├── merge.py        # train/test split → Parquet
│       ├── push.py         # HuggingFace Hub upload
│       ├── peek.py         # single-sample GraphQL preview
│       ├── manifest.py     # per-repo progress tracking
│       └── filters.py      # record quality filters
└── data/
    ├── raw/                # per-repo JSONL (append-only)
    ├── processed/          # filtered and formatted JSONL
    └── hf_upload/          # train.parquet / test.parquet

Output formats

Raw record (`data/raw/*.jsonl`)

Each line is a JSON object:

repo, language, domain, license, collected_at
issue_number, issue_title, issue_body, issue_labels, issue_created_at
pr_number, pr_title, pr_body, pr_merged_at, base_branch, base_sha, merge_sha, closing_pr_confidence
diff, review_count, changed_files, additions, deletions, has_tests, test_files_changed

Failed diff fetches are written to data/raw/*.errors.jsonl and retried automatically on the next run.

Processed record (`data/processed/*.jsonl`)

Formatted for instruction SFT:

messages: [{role: system}, {role: user}, {role: assistant}]
metadata: repo/language/domain identifiers plus issue/PR provenance fields

HF upload (`data/hf_upload/`)

train.parquet
test.parquet

Quality filters

Records are kept only if they pass all checks:

Issue body length ≥ 100 characters
Changed diff lines between 5 and 500
At least 1 PR review
Changed files ≤ 10

Resumability

data/raw/ is append-only; collection deduplicates by issue number
data/manifest.json tracks per-repo progress and is flushed during collection
Each pipeline stage is independent and reads from files on disk
Re-running collection is safe — already-collected issues are skipped

Publishing to PyPI

make publish

Requires ~/.pypirc with a valid PyPI token (username __token__, password pypi-...), or PYPI_API_TOKEN set in your environment.

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Mar 31, 2026

0.1.1

Mar 31, 2026

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patch_sft-0.1.2.tar.gz (125.7 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

patch_sft-0.1.2-py3-none-any.whl (22.2 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file patch_sft-0.1.2.tar.gz.

File metadata

Download URL: patch_sft-0.1.2.tar.gz
Upload date: Mar 31, 2026
Size: 125.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`2a9148c85b8345c59be1946a239d162751257fe92e7ddee7965e764fee4acf8d`
MD5	`5029482b382c37357cdcce41e2f33c91`
BLAKE2b-256	`9c5964f7f280c97e8c7293f8696e904926649e27d4cb3a8d3c6c5412c43185ff`

See more details on using hashes here.

File details

Details for the file patch_sft-0.1.2-py3-none-any.whl.

File metadata

Download URL: patch_sft-0.1.2-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1363ce97b8bd0b5ad984630b31478536836830f878f8bad4f2b39f68f5445d6d`
MD5	`cd3da3e54a537ab9afbd8c8c385e27e3`
BLAKE2b-256	`4da46c853c3daf99a07de421dd49c4e44fef65afe867614cda6c81df4d647585`

See more details on using hashes here.

patch-sft 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

patch-sft

Using the published dataset

Running your own collection

Requirements

Setup

Run the pipeline

Programmatic collection

CLI

Peek mode

Project layout

Output formats

Raw record (`data/raw/*.jsonl`)

Processed record (`data/processed/*.jsonl`)

HF upload (`data/hf_upload/`)

Quality filters

Resumability

Publishing to PyPI

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

patch-sft 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

patch-sft

Using the published dataset

Running your own collection

Requirements

Setup

Run the pipeline

Programmatic collection

CLI

Peek mode

Project layout

Output formats

Raw record (data/raw/*.jsonl)

Processed record (data/processed/*.jsonl)

HF upload (data/hf_upload/)

Quality filters

Resumability

Publishing to PyPI

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Raw record (`data/raw/*.jsonl`)

Processed record (`data/processed/*.jsonl`)

HF upload (`data/hf_upload/`)