Python toolchain for building and maintaining FormulaCode benchmark tasks.
Project description
FormulaCode is a continually updating benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an execution harness that connects a language model to our terminal sandbox. This repository contains the task generation pipeline.
fc-data is a python package for automatically curating and managing FormulaCode tasks. After installation, fc-data is designed to run as a monthly CRON job that updates the FormulaCode dataset with new commits and repositories.
High level overview
graph LR
A --->|scrape| B
A2 <-->|sync| B
B -->|publish| C
B -->|publish| D
A[Github]
A2[Supabase]
B["`fc-data
(This repository)`"]
C[DockerHub]
D[HuggingFace]
Use cases
fc-data is designed primarily to enable continual dataset updates for FormulaCode. After installation, the monthly update is a single command:
$ pip install fc-data
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01
This runs six stages in order: scrape repos, scrape commits, classify PRs, resolve packages, synthesize Docker images, and publish the docker images to DockerHub and the PRs to HuggingFace The dataset is versioned by month (e.g. formulacode@2026-03). In our servers, this command runs as a monthly CRON job.
However, this isn't the only use case for fc-data. We've designed fc-data to helps you manage your custom github-centric benchmark. Each benchmark contains a task which revolves around a GitHub Issue (or Pull request; which is just an issue with extra details). We include some helpful properties to start off:
from datasmith.github import PR, GitHubClient
from datasmith.utils import TokenPool
# Every task starts with a PR.
pr = PR(repository="astropy/astropy", issue_number=16222)
# PRs are frozen Pydantic v2 models — immutable after creation.
pr.merge_commit_sha # the merge commit sha
pr.base_sha # base branch commit
pr.cache_key # "astropy/astropy:16222" — used for Supabase caching
# Or fetch a fully-hydrated PR (tries Supabase first, then GitHub API):
pr = await PR.fetch("astropy/astropy", 16222)
pr.merge_commit_sha # now populated from the database or API
You can also fetch live data from GitHub using the async client directly:
pool = TokenPool() # reads GH_TOKENS env var, rotates tokens on rate-limit
gh = GitHubClient(pool)
# Fetch a PR from the GitHub API.
pr = await gh.get_pr("pandas-dev", "pandas", 16222)
# Fetch the diff as a string.
diff = await gh.get_diff("pandas-dev", "pandas", 16222)
# Fetch the timeline of events.
events = await gh.get_timeline("pandas-dev", "pandas", 16222)
Want to extract structured information from the PR? Use our built-in agents or define your own!
from datasmith.github import render_problem_statement, scrape_links
# Render a problem statement from the PR and its linked issues.
statement = render_problem_statement(pr, anonymize=True)
# You can also scrape for linked issues via BFS.
issues = await scrape_links(pr, gh.get_issue, depth=2, only_issues=True, limit=6)
# Then pass them into the renderer for richer context.
statement = render_problem_statement(pr, issues=issues, repo_description="pandas is a data analysis library")
Don't like the current set of operations? Define your own!
# You can register custom hooks for dataset-specific operations.
from datasmith.github import HookRegistry
from dspy import ChainOfThought
summarizer = ChainOfThought("document -> summary")
def summarize(pr):
doc = render_problem_statement(pr, anonymize=True)
return summarizer(doc).summary
HookRegistry.register("summarize", summarize) # auto-wrapped with @supabase_cached
# Now use it:
pr = PR(repository="astropy/astropy", issue_number=16222)
HookRegistry.call("summarize", pr) # first call: hits LLM
HookRegistry.call("summarize", pr) # second call: reads from Supabase cache. No cost!
Almost all our supported operations can be run asynchronously. Here's how to run some FormulaCode-specific operations at scale:
from datasmith.runners import ClassifyPRsRunner
from datasmith.agents import PerfClassifier, ClassifyJudge
runner = ClassifyPRsRunner(PerfClassifier(), ClassifyJudge(), n_concurrent=64)
await runner.run(pr_items)
# Progress tracked in Supabase runner_progress table.
# Per-item failures logged in runner_failures — the runner never aborts.
By default, each operation is cached in Supabase so you don't keep hitting expensive hooks.
A pull request is useless if you cannot build a reproducible environment for it. fc-data supports building docker images for any pull request using a three-tier hierarchy:
from datasmith.docker import ImageManager, MultiObjVerifier, SmokeVerifier, ProfileVerifier
mgr = ImageManager()
mgr.build_base_image() # formulacode/base:latest (uses the default Dockerfile.base)
mgr.build_repo_image("pandas-dev", "pandas",) # formulacode/pandas-dev-pandas:latest (Look up Dockerfile.repo for pandas-dev/pandas that should be stored in supabase or fallback to the default Dockerfile.repo)
mgr.build_pr_image("pandas-dev", "pandas", 16222,) # formulacode/pandas-dev-pandas:16222 (Look up Dockerfile.pr for pandas-dev/pandas:16222 that should be stored in supabase or fallback to the default Dockerfile.pr)
# Alternatively, if the user wants to use a custom Dockerfile, they can do so by:
mgr.build_base_image(context="path/to/custom/context")
mgr.build_repo_image("pandas-dev", "pandas", context="path/to/custom/context")
mgr.build_pr_image("pandas-dev", "pandas", 16222, context="path/to/custom/context")
# Verify an image with a chain of verifiers — short-circuits on first failure.
verifier = MultiObjVerifier(verifiers=[
SmokeVerifier("pandas"), # can we import the package?
ProfileVerifier(timeout=300), # can we discover and run ASV benchmarks?
])
result = verifier.verify("formulacode/pandas-dev-pandas:16222")
# result.ok, result.rc, result.stdout, result.stderr, result.duration_s
One of the main features of fc-data is the ability to automatically synthesize docker containers for a pull request. The synthesizer is a state machine that checks Supabase for cached contexts, tries similar build scripts, then falls back to an installed CLI agent (Claude Code, Codex, or Gemini — auto-detected):
from datasmith.agents import Synthesizer
from datasmith.docker import MultiObjVerifier, SmokeVerifier, ProfileVerifier
from datasmith.docker.context import DockerContext
# The verifier chain validates each synthesis attempt.
verifier = MultiObjVerifier(verifiers=[
SmokeVerifier("pandas"), # can we import the package?
ProfileVerifier(timeout=300), # can we discover and run ASV benchmarks?
])
# Load a base Docker build context (Dockerfile + shell scripts) to iterate on.
base_context = DockerContext.from_directory("dataset/formulacode_verified/pandas-dev_pandas/abc123")
synth = Synthesizer(max_attempts=3)
ctx = synth.run(
owner="pandas-dev",
repo="pandas",
issue_number=16222,
pr_context="This PR optimizes groupby performance by ...",
verifier=verifier,
sha="abc123def456",
base_context=base_context,
env_payload='{"dependencies": ["numpy==1.26.0", "cython==3.0.0"]}',
python_version="3.10",
)
# Checking cache for pandas-dev/pandas@abc123def456... [MISS]
# Found 4 similar scripts from pandas-dev/pandas
# Attempt 1/4 with similar script... [FAIL]
# Launching claude agent sandbox in /tmp/synthesis-xxx...
# Sandbox synthesis succeeded [PASS]
# Saved context for pandas-dev/pandas@abc123def456
#
# On success, the DockerContext is persisted to Supabase's candidate_containers table.
# ctx is a DockerContext with the working build scripts, or None if all attempts failed.
If ALL attempts fail, synthesize logs every attempt (stderr, stdout, model, script used) to Supabase's build_attempts table and returns None. Failed PRs can be retried later — the logged attempts provide context for debugging or a future synthesis run.
This can be run asynchronously as well for multiple tasks (WARNING: Might be expensive!):
from datasmith.runners import SynthesizeImagesRunner
runner = SynthesizeImagesRunner(synth, verifier, n_concurrent=8)
await runner.run(pr_items)
# Returns None entries for PRs where synthesis failed.
How do we make a dataset out of this? Query Supabase directly and publish:
from datasmith.utils.db import get_client
from datasmith.publish import records_from_supabase, HuggingFacePublisher
# Query all verified, unpublished perf PRs from the last month.
records = records_from_supabase(start_date="2026-02-01", end_date="2026-03-01")
# Or query Supabase directly for more control.
sb = get_client()
rows = sb.table("pull_requests") \
.select("*") \
.eq("is_performance_commit", True) \
.not_.is_("container_name", "null") \
.execute()
# Publish to HuggingFace as a versioned Parquet dataset.
hf = HuggingFacePublisher()
hf.publish(records, version="formulacode@2026-03")
We define tasks using terminal-bench's formulacode adapter for evaluation:
from terminal_bench.adapters.formulacode import FormulaCodeAdapter
from terminal_bench.harness.harness import Harness
adapter = FormulaCodeAdapter(task_dir="fctasks/", force=True)
adapter.generate_task(pr.to_record())
run = Harness(
output_path="fcevals/",
dataset_path="dataset_path",
task_ids=[pr.to_record().task_id],
agent_configs=[
{"agent_name": "nop", "model_name": "nop"},
{"agent_name": "oracle", "model_name": "oracle"},
],
)
print(run.results[0].is_resolved) # Did the oracle get a speedup > 1.00 over baseline?
Database schema
There are xix tables in Supabase (Postgres):
| Table | Primary key | Purpose |
|---|---|---|
repositories |
(owner, repo) |
Scraped GitHub repos (language, stars, topics, description) |
pull_requests |
(owner, repo, issue_number) |
PR metadata, classification, rendered problems, publish status |
hook_cache |
(entity_key, hook_name, args_hash) |
Deterministic cache for @supabase_cached |
build_attempts |
id (serial) |
Every Docker build attempt (model, script, ok, stderr/stdout tails) |
runner_progress |
runner_id |
Per-runner progress (total, completed, failed) |
runner_failures |
id (serial) |
Per-item failure details (error message, traceback) |
Installation
Install uv and Node.js (for Supabase CLI), then set up the development environment:
# Install uv
$ curl -LsSf https://astral.sh/uv/install.sh | sh
# Install npm (for Supabase CLI)
$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
$ nvm install --lts
$ nvm use --lts
# Install dev environment and pre-commit hooks
$ make install
Create a tokens.env file in the repo root:
# Supabase (required)
SUPABASE_URL=http://127.0.0.1:54321
SUPABASE_KEY=your-service-role-key
# GitHub (required — comma-separated for multiple tokens)
GH_TOKENS=github_pat_xxx,github_pat_yyy
# LLM backends (for classification and synthesis)
DSPY_MODEL=openai/gpt-oss-120b
DSPY_API_BASE=http://localhost:30000/v1
DSPY_API_KEY=local
DSPY_MAX_TOKENS=16000
# DockerHub (for publishing)
DOCKERHUB_USERNAME=formulacode
DOCKERHUB_TOKEN=dckr_pat_xxxxx
# HuggingFace (for dataset publishing)
HF_TOKEN_PATH=/path/to/huggingface/token
Supabase
Start the local Supabase instance and apply all migrations:
$ npx supabase start # starts Postgres, Auth, Storage, Studio, etc.
$ npx supabase migration up --local # apply migrations in supabase/migrations/
Common commands:
$ npx supabase status # show URLs, ports, and service health
$ npx supabase migration list --local # list applied / pending migrations
$ npx supabase db reset # wipe and recreate from migrations (destructive)
$ npx supabase stop # stop all containers
Studio is available at the URL printed by supabase status (default http://127.0.0.1:54323) — use it to browse tables, run SQL, and inspect data.
Running preflight ensures that all the variables are properly defined:
$ python -m datasmith.preflight
== Environment ==
[OK] SUPABASE_URL — http://127.0.0.1:54...
[OK] SUPABASE_KEY — ***
[OK] GH_TOKENS — 3 token(s)
[OK] HF_TOKEN — /path/to/huggingface/token
== Supabase ==
[OK] Connection
== Docker ==
[OK] Docker daemon
== GitHub ==
[OK] API access — remaining=4998
========================================
All checks passed!
After that works, run the tests locally. Each new functionality MUST have a test:
$ make check # ruff lint + mypy type check
$ make test # pytest
Updating FormulaCode
The monthly update is a single command:
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01
This runs six stages in order: scrape repos, scrape commits, classify PRs, resolve packages, synthesize Docker images, and publish to DockerHub + HuggingFace. Options:
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01 --resume # skip completed stages
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 4 # run only package resolution
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01 --dry-run # log without executing
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5 \
--agent codex --n-concurrent 5 --tasks-per-repo 5 # synthesis with codex, capped
$ fc-data --start-date 2026-02-01 --end-date 2026-03-01 --stage 5 \
--force # re-run synthesis for all tasks
| Flag | Description |
|---|---|
--resume |
Skip stages already marked complete and resume from the next pending stage |
--stage N |
Run only stage N (1–6) |
--dry-run |
Log what each stage would do without executing |
--n-concurrent N |
Max concurrent items per runner stage |
--tasks-per-repo N |
Cap tasks per repository for stage 5 (synthesize_images) |
--agent {claude,codex,gemini} |
CLI agent for stage 5 synthesis (default: auto-detect first available) |
--force |
Re-run synthesis even for tasks that already have a container or cached context (stage 5 only) |
Dataset verification
Each task lives in dataset/formulacode_verified/<owner_repo>/<sha>/ with a multi-stage Dockerfile and shell build scripts. The verification loop:
$ python dataset/verify.py --task dataset/formulacode_verified/<owner_repo>/<sha>
# Check failure.json for errors -> edit docker_build_pkg.sh / docker_build_run.sh -> rerun
# Done when verification_success.json appears
Only modify docker_build_pkg.sh and docker_build_run.sh during verification fixes.
$ python scratch/scripts/prepare_formulacode_dataset.py \
--input scratch/artifacts/pipeflush/perfonly_commits_master.parquet \
--output scratch/artifacts/pipeflush/perfonly_enriched.parquet \
--dockerhub-repository formulacode/all \
--upload-to-hf formulacode/formulacode-all \
--hf-verified-filter /path/to/valid_tasks.json
Requires
HF_TOKENintokens.env. The upload createsdefault,verified, and per-month (YYYY-MM) configs on Hugging Face.
Evaluation
Evaluation is done in FormulaCode's fork of the terminal-bench evaluation framework.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fc_data-0.2.0.tar.gz.
File metadata
- Download URL: fc_data-0.2.0.tar.gz
- Upload date:
- Size: 40.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d47d744cadf9799e2bfbc77f05257ea43b351cb891a06c81b3d19734fe595fa8
|
|
| MD5 |
b86c6da583b4c2936a43719396122917
|
|
| BLAKE2b-256 |
5a4d6302b0a13c00b3127a7614e9bbe3a7aa4b5e05943d3c9eeedcb36d3755ce
|
Provenance
The following attestation bundles were made for fc_data-0.2.0.tar.gz:
Publisher:
publish.yml on formula-code/datasmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fc_data-0.2.0.tar.gz -
Subject digest:
d47d744cadf9799e2bfbc77f05257ea43b351cb891a06c81b3d19734fe595fa8 - Sigstore transparency entry: 1255604871
- Sigstore integration time:
-
Permalink:
formula-code/datasmith@88abd360bc9e605b217e144bd0e73a16db49f091 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/formula-code
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88abd360bc9e605b217e144bd0e73a16db49f091 -
Trigger Event:
release
-
Statement type:
File details
Details for the file fc_data-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fc_data-0.2.0-py3-none-any.whl
- Upload date:
- Size: 174.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06cb089265a588be9b33cc992089bb180239b60744c681dddd6f4f72f6edaeff
|
|
| MD5 |
5cf1f675d4b727ca094bac23fb16513e
|
|
| BLAKE2b-256 |
6493524b4aff845877cea57d20cb4e26f45d6b4a3ab9acd6db0ff3b62f297011
|
Provenance
The following attestation bundles were made for fc_data-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on formula-code/datasmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fc_data-0.2.0-py3-none-any.whl -
Subject digest:
06cb089265a588be9b33cc992089bb180239b60744c681dddd6f4f72f6edaeff - Sigstore transparency entry: 1255604964
- Sigstore integration time:
-
Permalink:
formula-code/datasmith@88abd360bc9e605b217e144bd0e73a16db49f091 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/formula-code
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88abd360bc9e605b217e144bd0e73a16db49f091 -
Trigger Event:
release
-
Statement type: