Skip to main content

A tool for generating synthetic function call datasets for Large Language Models (LLMs).

Project description

🛠️ openllm-func-call-synthesizer

PyPI version Documentation Status

Lightweight toolkit to synthesize function-call datasets and convert them to formats compatible with OpenAI-style function-call training and downstream tooling (including Llama Factory compatible exports).


✨ Features

  • 📝 Generate synthetic function call datasets for LLM training and evaluation
  • ⚙️ Flexible configuration via YAML and Hydra
  • 💻 CLI interface powered by Typer & Rich
  • 🔧 Utility functions for dataset manipulation
  • 🔄 Extensible and easy to integrate into your own pipeline
  • 🌐 Supports multiple LLM backends (OpenAI, Google, etc.)
  • 📊 Export formats: JSONL, CSV, Parquet, LlamaFactory-compatible

🛠 Installation

Prerequisites

  • Python 3.12+ (match environment used by the project)

  • API credentials for any LLM backend (set via environment variables or .env file)

    • Example: OPENAI_API_KEY
    • See .env.example for reference
  • 🔌 MCP Server (Required)

    This project relies on an MCP server to provide tool/function metadata.

    Before running the synthesizer, you must start an MCP server.

    ▶ Start the example MCP server

    An example MCP server is included in the repository:

    python examples/mcp_example_sserver/server.py

    This will start a local MCP server that the synthesizer can connect to.

    Make sure your configuration (e.g. mcp_servers.transport) matches the server address.

    ⚠ Important

    • The synthesizer will fail if no MCP server is available.
    • Ensure the server is running before executing:

python -m apps.main

* If you see connection errors, verify:
* The server is running
* The transport URL in your config is correct
* Network/firewall settings allow local connections

Install from PyPI

pip install openllm-func-call-synthesizer
# or using uv
uv add openllm-func-call-synthesizer

Install from source

git clone https://github.com/diqiuzhuanzhuan/openllm-func-call-synthesizer.git
cd openllm-func-call-synthesizer
uv sync

Is there no tool named 'uv'? You can install it with just one command:

curl -LsSf https://astral.sh/uv/install.sh | sh

⚡ Quickstart

Run the synthesizer with default config:

python -m apps.main

Enable only query generation:

python -m apps.main synthesizer.query_generation.enable=True

Enable function-call generation with custom name:

python -m apps.main synthesizer.function_call_generation.enable=True synthesizer.function_call_generation.name=function_call_gpt_4o

Override languages dynamically:

python -m apps.main synthesizer.query_generation.languages=[English,Spanish]

📂 Outputs

  • Generated datasets are written to data//
  • Each run produces:
  • train.jsonl
  • output.csv
  • output.parquet
  • llama_factory step creates LlamaFactory-compatible train.jsonl

🧪 Testing

Run the test suite:

pytest -q

📝 Configuration Highlights

Configuration file: examples/conf/synthesizer/default.yaml

  • mcp_servers — MCP server(s) to query for available tools
  • choose_part_tools — filter toolset to a subset
  • query_generation — generate seed queries from function docs
  • function_call_generation — generate function-call pairs from queries
  • critic — optional scoring/critique step
  • llama_factory — export to LlamaFactory-compatible dataset
  • verl - export to verl-compatible dataset

See docs for full field descriptions.

Default pipeline walk-through

The provided examples/conf/synthesizer/default.yaml wires every stage together:

  • MCP bootstrap: points to a local ugreen_mcp server on http://localhost:8000/mcp; leave it running before launching the synth job or queries will fail.
  • Tool filtering: choose_part_tools: false keeps the full toolset; set it to a list (e.g. ["search_photos"]) to restrict generations to specific tools.
  • Query generation: reads examples/function_docs.json, emits multilingual prompts (English/Chinese/Japanese/German) under data/function_query via parallel OpenAI + Google model pools, each with generous TPM throttles for high-throughput runs.
  • Function-call synthesis: consumes the query dataset, calls gpt-4o through the OpenAI backend, and writes data/function_call_gpt_4o/*.jsonl (set max_num to limit volume or switch output_format).
  • Critic pass: re-scores every call with gpt-5-mini-2025-08-07, expecting query/prompt/function_call/functions/answer fields and emitting a scored dataset named function_call_gpt_4o_critiqued_by_gpt_5_mini_2025_08_07.
  • Downstream exports: both llama_factory and verl blocks draw from the critic output, keep only rows with score >= 8, and materialize ready-to-train JSONL files plus optional train/val splits.

Feel free to copy the default file, tweak model lists or directories, and pass it via python -m apps.main synthesizer=@your_config.yaml for customized runs. For custom configurations, please refer to example/conf/synthesizer/default.yaml. ⸻

🐚 Parallel Runner

Helper script: bin/run_pipeline.sh

  • Launch multiple synthesizer runs in parallel
  • Requires .venv virtual environment
  • Example usage:
chmod +x bin/run_pipeline.sh
bin/run_pipeline.sh default other
  • Logs are printed to console; returns non-zero if any run fails
  • Can also run manually using:
python -m apps.main synthesizer=default &
python -m apps.main synthesizer=other &
wait

Contributing

Welcome to contribute!Please refer to CONTRIBUTING.md for details.

License

MIT License. See LICENSE for details.

Links

🌟 Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openllm_func_call_synthesizer-0.1.2.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openllm_func_call_synthesizer-0.1.2-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file openllm_func_call_synthesizer-0.1.2.tar.gz.

File metadata

  • Download URL: openllm_func_call_synthesizer-0.1.2.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openllm_func_call_synthesizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d1641a71151f74360083d525f505c1d3871455a8b93bb912217bce7c203515da
MD5 f33fe8bb9f620b1a84689dcd0697ffc9
BLAKE2b-256 4d1daf990cb8bbe49aca0d516c232566a7e147146183579740292cf4965cae14

See more details on using hashes here.

File details

Details for the file openllm_func_call_synthesizer-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: openllm_func_call_synthesizer-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openllm_func_call_synthesizer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c8e710d5dcb0e620bc56696efcf21d7785579830071bbf74bb3634538a61041b
MD5 698fc245efcf2ebabc31bde848db513e
BLAKE2b-256 8deb53004662e324e03cf6ffcc34cc9aa00cde5408f1fff5138793e3ae0986fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page