Skip to main content

GitHub issue-to-diff dataset pipeline for supervised fine-tuning

Project description

patch-sft

patch-sft is a Python pipeline for building supervised fine-tuning (SFT) datasets from real GitHub engineering history.

It collects closed GitHub issues and linked merged pull requests, extracts unified diffs, applies quality filters, formats examples into messages[] chat records, and publishes train/test splits to Hugging Face.

Using the published dataset

If you just want the pre-built dataset, no GitHub token or local pipeline needed:

import patch

ds = patch.load("your-hf-username/patch-sft")
print(ds)
# DatasetDict({
#     train: Dataset({...}),
#     test:  Dataset({...}),
# })

df = ds["train"].to_pandas()

Or install and use it as a dependency in another project:

# pyproject.toml
dependencies = ["patch-sft>=0.1.0"]
import patch

# Download from HuggingFace Hub
ds = patch.load("your-hf-username/patch-sft")

# Train split only
train_ds = patch.load("your-hf-username/patch-sft", split="train")

Running your own collection

To collect from GitHub and build your own dataset:

Requirements

  • Python 3.11+
  • uv
  • GitHub personal access token
  • Hugging Face account/token (for pushing)

Setup

git clone https://github.com/your-username/patch-sft
cd patch-sft
make sync
make init-env

Edit .env with your values:

GITHUB_TOKEN=ghp_...        # required for collection
HF_REPO_ID=yourname/patch-sft
HF_TOKEN=hf_...             # optional if using `huggingface-cli login`
DATA_DIR=./data             # optional, defaults to ./data
PYPI_API_TOKEN=pypi-...     # only needed for `make publish`

Run the pipeline

# All repos
make collect
make process
make merge
make push

# Single repo
make collect-repo REPO=apache/arrow
make process-repo REPO=apache/arrow

# Full pipeline in one shot
make pipeline

Programmatic collection

import patch
from patch import RepoConfig

patch.collect(
    repos=[
        RepoConfig("apache", "arrow", "python", "data-engineering"),
        RepoConfig("fastapi", "fastapi", "python", "web"),
    ],
    token="ghp_...",      # or set GITHUB_TOKEN env var
    data_dir="./data",
)

CLI

After installation, the patch-sft command is available:

patch-sft collect [--repo owner/repo]
patch-sft process [--repo owner/repo]
patch-sft merge
patch-sft push
patch-sft peek

Peek mode

Preview GraphQL request structure and one sample record without writing any files:

make peek
# or
patch-sft peek

Hardcoded target is pola-rs/polars. Prints a request example and a normalized sample record to stdout.

Project layout

patch-sft/
├── pyproject.toml
├── .env.example
├── Makefile
├── src/
│   └── patch/
│       ├── __init__.py     # public API: collect(), load(), RepoConfig, REPOS
│       ├── cli.py          # patch-sft CLI entry point
│       ├── repos.py        # curated repo list and RepoConfig
│       ├── collect.py      # GitHub GraphQL + REST collection
│       ├── process.py      # quality filters and SFT formatting
│       ├── merge.py        # train/test split → Parquet
│       ├── push.py         # HuggingFace Hub upload
│       ├── peek.py         # single-sample GraphQL preview
│       ├── manifest.py     # per-repo progress tracking
│       └── filters.py      # record quality filters
└── data/
    ├── raw/                # per-repo JSONL (append-only)
    ├── processed/          # filtered and formatted JSONL
    └── hf_upload/          # train.parquet / test.parquet

Output formats

Raw record (data/raw/*.jsonl)

Each line is a JSON object:

  • repo, language, domain, license, collected_at
  • issue_number, issue_title, issue_body, issue_labels, issue_created_at
  • pr_number, pr_title, pr_body, pr_merged_at, base_branch, base_sha, merge_sha, closing_pr_confidence
  • diff, review_count, changed_files, additions, deletions, has_tests, test_files_changed

Failed diff fetches are written to data/raw/*.errors.jsonl and retried automatically on the next run.

Processed record (data/processed/*.jsonl)

Formatted for instruction SFT:

  • messages: [{role: system}, {role: user}, {role: assistant}]
  • metadata: repo/language/domain identifiers plus issue/PR provenance fields

HF upload (data/hf_upload/)

  • train.parquet
  • test.parquet

Quality filters

Records are kept only if they pass all checks:

  • Issue body length ≥ 100 characters
  • Changed diff lines between 5 and 500
  • At least 1 PR review
  • Changed files ≤ 10

Resumability

  • data/raw/ is append-only; collection deduplicates by issue number
  • data/manifest.json tracks per-repo progress and is flushed during collection
  • Each pipeline stage is independent and reads from files on disk
  • Re-running collection is safe — already-collected issues are skipped

Publishing to PyPI

make publish

Requires ~/.pypirc with a valid PyPI token (username __token__, password pypi-...), or PYPI_API_TOKEN set in your environment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patch_sft-0.1.2.tar.gz (125.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

patch_sft-0.1.2-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file patch_sft-0.1.2.tar.gz.

File metadata

  • Download URL: patch_sft-0.1.2.tar.gz
  • Upload date:
  • Size: 125.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2a9148c85b8345c59be1946a239d162751257fe92e7ddee7965e764fee4acf8d
MD5 5029482b382c37357cdcce41e2f33c91
BLAKE2b-256 9c5964f7f280c97e8c7293f8696e904926649e27d4cb3a8d3c6c5412c43185ff

See more details on using hashes here.

File details

Details for the file patch_sft-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: patch_sft-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.7

File hashes

Hashes for patch_sft-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1363ce97b8bd0b5ad984630b31478536836830f878f8bad4f2b39f68f5445d6d
MD5 cd3da3e54a537ab9afbd8c8c385e27e3
BLAKE2b-256 4da46c853c3daf99a07de421dd49c4e44fef65afe867614cda6c81df4d647585

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page