Evaluating Program Semantics Reasoning with Type Inference in System F

Project description

TF-Bench

Evaluating Program Semantics Reasoning with Type Inference in System F

evaluation workflow

If you find this work useful, please cite us as:

@inproceedings{he2025tfbench,
    author = {He, Yifeng and Yang, Luning and Gonzalo, Christopher and Chen, Hao},
    title = {Evaluating Program Semantics Reasoning with Type Inference in System F},
    booktitle = {Neural Information Processing Systems (NeurIPS)},
    date = {2025-11-30/2025-12-07},
    address = {San Diego, CA, USA},
}

Development

Python

We use Python 3.11. We recommend using uv to manage your Python dependencies.

cd TF-Bench
uv sync # create a virtual environment, and install dependencies

Haskell

To run evaluation, you need GHC (the Glasgow Haskell Compiler) installed. We recommend using ghcup to install.

curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh

Due to the GHC dependency, the evaluation module currently only supports Linux and macOS. Our evaluation requires Haskell language extensions type operators and impredicative polymorphism, so we require GHC version >= 9.2.1. Our evaluation used GHC-9.6.7.

Building TF-Bench from scratch (optional)

TF-Bench (base)

This script will build the benchmark (Prelude with NL) from the raw data.

uv run scripts/preprocess_benchmark.py -o tfb.json

TF-Bench (pure)

git clone https://github.com/SecurityLab-UCD/alpharewrite.git
cd alpharewrite

stack build
stack exec alpharewrite-exe 1 ../tfb.json > ../tfb.pure.json

cd ..

For details, please check out the README of alpharewrite.

Download pre-built benchmark

You can also use TF-Bench on HuggingFace datasets.

from datasets import load_dataset

split = "pure" # or "base"
dataset = load_dataset("SecLabUCD/TF-Bench", split=split)

Or through our provided package.

from tfbench import load_tfb_from_hf

dataset = load_tfb_from_hf(split)

Using as an application

git clone https://github.com/SecurityLab-UCD/TF-Bench.git
cd TF-Bench
uv sync

Please have your API key ready in .env.

Proprietary models

We use each provider's official SDK to access their models. You can check our pre-supported models in tfbench.lm module.

from tfbench.lm import supported_models
print(supported_models)

To run single model, which runs both base and pure splits:

uv run main.py -m gpt-5-2025-08-07

Open-weights models with Ollama

We use Ollama to manage and run the OSS models reported in the Appendix. We switched to vLLM for better performance and SDK design. Although the Ollama option is still available, it is no longer maintained. We recommend using vLLM instead.

curl -fsSL https://ollama.com/install.sh | sh # install ollama, you need sudo for this
ollama serve # start your own instance instead of a system service

NOTE: we required the ollama version at least 0.9.0 to enable thinking parsers. We use 0.11.7 for our experiments.

> ollama --version
ollama version is 0.11.7

Run the benchmark.

uv run src/main.py -m llama3:8b

Running any model on HuggingFace Hub

We also support running any model that is on HuggingFace Hub out-of-the-box. We provide an example using Qwen3.

uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models

Note that our main.py uses a pre-defined model router, which routes all un-recognized model names to HuggingFace. We use the </think> token to parse thinking process, if the model do it differently, please see the next section.

Running your own model

To support your customized model, you can input the path to your HuggingFace compatible checkpoint to our main.py.

uv run src/main.py <path to your checkpoint>

Using as a package

Our package is also available on PyPi.

uv add tfbench

Or directly using pip, you know the way

pip install tfbench

Proprietary model checkpoints that are not currently supported

Our supported model list is used to route the model name to the correct SDK. Even a newly released model is not in our supported models list, you can still use it by specifying the SDK client directly. We take OpenAI GPT-4.1 as and example here.

from tfbench.lm import OpenAIResponse
from tfbench import run_one_model

model = "gpt-4.1"
split = "pure"
client = OpenAIResponses(model_name=model, pure=split == "pure", effort=None)
eval_result = run_one_model(client, pure=split == "pure", effort=None)

Support other customized models

You may implement an LM instance.

from tfbench.lm._types import LM, LMAnswer

class YourLM(LM):
    def __init__(self, model_name: str, pure: bool = False):
        """initialize your model"""
        super().__init__(model_name=model_name, pure=pure)
        ...

    def _gen(self, prompt: str) -> LMAnswer:
        """your generation logic here"""
        return LMAnswer(answer=content, reasoning_steps=thinking_content)

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfbench-1.0.0.tar.gz (27.1 kB view details)

Uploaded Sep 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfbench-1.0.0-py3-none-any.whl (36.7 kB view details)

Uploaded Sep 30, 2025 Python 3

File details

Details for the file tfbench-1.0.0.tar.gz.

File metadata

Download URL: tfbench-1.0.0.tar.gz
Upload date: Sep 30, 2025
Size: 27.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for tfbench-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0c17231759ccadad442c0e1edf80f3f93cedf56fa6b42162fc660a8c537d3148`
MD5	`30bb9a73ee83984a78ff056e822679a1`
BLAKE2b-256	`ecfc881637bca89fd4a2b0cca276208a7e66624c903d1e7cdb8179d547771906`

See more details on using hashes here.

File details

Details for the file tfbench-1.0.0-py3-none-any.whl.

File metadata

Download URL: tfbench-1.0.0-py3-none-any.whl
Upload date: Sep 30, 2025
Size: 36.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for tfbench-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5f3a2ba4aade82cd09e0f697527e9702f6fafd53bfaf892b7fe4594a54dbf0b`
MD5	`d1cbb99d46c84518571077d3ea262caf`
BLAKE2b-256	`2d51e2f4b12dfc8fbf7c91f6c3789c4aaac7c6cac521a308b0966a774cfd3206`

See more details on using hashes here.

tfbench 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TF-Bench

Development

Python

Haskell

Building TF-Bench from scratch (optional)

TF-Bench (base)

TF-Bench (pure)

Download pre-built benchmark

Using as an application

Proprietary models

Open-weights models with Ollama

Running any model on HuggingFace Hub

Running your own model

Using as a package

Proprietary model checkpoints that are not currently supported

Support other customized models

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes