Skip to main content

Tracing, evaluation, and training utilities for LLM applications.

Project description

freesolo

freesolo is a Python tracing and evaluation package for LLM apps.

It is built for the lowest-friction integration possible:

  1. Install the package
  2. Set FREESOLO_API_KEY
  3. Wrap your OpenAI, Anthropic, Gemini, or OpenAI-compatible client
  4. Run traces and evaluations from the package APIs

Current provider support

freesolo currently supports automatic client instrumentation for:

  • OpenAI
  • Anthropic
  • Gemini
  • OpenAI-compatible clients via wrap(...) / wrap_provider(...)

Install

Install the package plus the provider client you use:

pip install freesolo openai

or

pip install freesolo anthropic

or

pip install freesolo google-genai

Environment

  • FREESOLO_API_KEY
  • FREESOLO_BASE_URL (optional, defaults to https://api.freesolo.co)
export FREESOLO_API_KEY=fslo_...

Quickstart

from openai import OpenAI
from freesolo import wrap

client = wrap(OpenAI())

result = client.responses.create(
    model="gpt-4.1-mini",
    instructions="Reply in plain text.",
    input=[
        {
            "role": "user",
            "content": [{"type": "input_text", "text": "How do I reset my password?"}],
        }
    ],
)

print(result.output_text or "")

OpenRouter Quickstart

from openai import OpenAI
from freesolo import wrap

client = wrap(
    OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key="YOUR_OPENROUTER_API_KEY",
    )
)

response = client.chat.completions.create(
    model="openai/gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "Reply in plain text."},
        {"role": "user", "content": "Write a one-sentence launch blurb."},
    ],
    max_tokens=120,
)

print(response.choices[0].message.content or "")

Gemini Quickstart

from google import genai
from freesolo import instrument_gemini

client = instrument_gemini(genai.Client())

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Write a one-sentence release note for traced Gemini support.",
)

print(response.text or "")

Group Multiple Model Calls

For agentic or long-horizon tasks, strongly prefer wrapping the whole task in start_trace(...) so all of the model calls land in one trace.

For a single one-off OpenAI, Anthropic, or Gemini request, you can skip it.

from anthropic import Anthropic
from freesolo import instrument_anthropic, start_trace

client = instrument_anthropic(Anthropic())

with start_trace("support-agent-run"):
    first = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say hello"}],
    )
    second = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say goodbye"}],
    )

What Gets Stored

  • Trace title if you explicitly pass it to start_trace("...")
  • Trace metadata if you explicitly pass it to start_trace(..., metadata=...)
  • Input payloads with system_prompt, user_prompt, and images
  • Output payloads as plain text
  • Token usage when available
  • Image inputs with inline previews for the trace UI

Notes

  • You do not need @trace() for ordinary LLM tracing.
  • A single instrumented OpenAI, Anthropic, or Gemini request creates a trace automatically.
  • For OpenAI-compatible providers like OpenRouter, prefer wrap(...) instead of provider-specific helpers.
  • For agentic or long-horizon workflows, strongly recommend start_trace("descriptive-title") so planning, retries, and follow-up calls stay grouped.
  • Delivery is best-effort by default. Trace ingestion failures do not break your app.

Evaluations

freesolo also includes a small evaluation API for CI jobs, GitHub bots, and eval scripts. All evaluation runs require FREESOLO_API_KEY or an explicit api_key.

Evaluation data is a list of plain dictionaries. There is no separate Example class to construct.

Define scorers by subclassing CustomScorer and returning BinaryResponse or NumericResponse. Scorers run in your process, and Freesolo uploads the final results with your API key. Pass scorer objects, not strings.

from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class ExactMatch(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        actual = str(row.get("actual_output", "")).strip()
        expected = str(row.get("expected_output", "")).strip()
        return BinaryResponse(
            value=actual == expected and bool(actual),
            reason="actual_output matched expected_output",
        )


client = EvaluationClient()

results = client.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris",
            "expected_output": "Paris",
        }
    ],
    scorers=[ExactMatch()],
)

print(results[0].success)

Tinker Deployment

freesolo.utils.deployment is a thin proxy for the Modal deployment server. It posts a Tinker checkpoint URL to the pinned Modal /deployments endpoint and returns the server JSON response.

from freesolo.utils.deployment import deploy_tinker_checkpoint

result = deploy_tinker_checkpoint(
    "tinker://<run_id>/sampler_weights/final",
    base_model="Qwen/Qwen3.5-35B-A3B",
)

print(result["repoId"])

Environment-driven evaluations

For training contracts, you can use the same Environment adapter for evals, SFT, and GRPO. run_environment loads examples, builds prompt messages, calls your model callback, scores the response through the environment, and uploads the same scorers_data shape used by the eval DB.

from typing import Any

from openai import OpenAI

from freesolo.environments import (
    Environment,
    EnvironmentGeneration,
    RewardMetric,
    RewardResult,
    TaskExample,
)
from freesolo.evaluation import EvaluationClient


class ContractEnvironment(Environment):
    def build_prompt_messages(
        self,
        example: TaskExample,
        contract_text: str,
    ):
        return [
            {"role": "system", "content": contract_text},
            {"role": "user", "content": example.task},
        ]

    def score_response(
        self,
        example: TaskExample,
        response_text: str,
    ) -> RewardResult:
        passed = response_text.strip() == str(example.expected_output).strip()
        return RewardResult(
            name="exact_match",
            score=1.0 if passed else 0.0,
            success=passed,
            threshold=1.0,
            reason="matched expected output" if passed else "mismatch",
            return_type="binary",
            metrics=(
                RewardMetric(
                    name="canonical_match",
                    score=1.0 if passed else 0.0,
                    success=passed,
                    threshold=1.0,
                ),
            ),
        )


model = OpenAI()


def generate(messages: list[dict[str, str]], example: TaskExample):
    response = model.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
    )
    return EnvironmentGeneration(
        response_text=response.choices[0].message.content or "",
        total_tokens=response.usage.total_tokens if response.usage else None,
    )


results = EvaluationClient().run_environment(
    name="contract-eval",
    source="eval.jsonl",
    contract_path="TRAINING_CONTRACT.md",
    environment=ContractEnvironment(),
    generate=generate,
)

RewardResult is the top-level scorer entry stored in eval_tasks.scorers_data. Its fields are:

  • name: scorer name shown in the UI.
  • score: numeric reward value.
  • success: pass/fail. If omitted, Freesolo derives it from threshold, then from whether score > 0.
  • threshold, value, reason, error, return_type: scorer display and pass/fail context.
  • latency_ms, total_tokens: optional per-response usage metadata.
  • metadata: JSON object for scorer-specific details.
  • metrics: optional RewardMetric components, also JSON-only, with name, score, value, success, threshold, weight, reason, and metadata.

Custom scorer:

from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class NoEmptyAnswer(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        ok = bool(str(row.get("actual_output", "")).strip())
        return BinaryResponse(value=ok, reason="actual_output is non-empty")


results = EvaluationClient().run(
    name="support-agent-non-empty",
    data=[{"actual_output": "hello"}],
    scorers=[NoEmptyAnswer()],
)

LLM-as-judge is also a custom scorer. The scorer can call your judge model and return a NumericResponse; Freesolo stores the eval run and score output with your FREESOLO_API_KEY. This example uses OPENAI_API_KEY for the judge model call and FREESOLO_API_KEY for eval upload.

import json
from typing import Any

from openai import OpenAI

from freesolo import instrument_openai
from freesolo.evaluation import CustomScorer, EvaluationClient, NumericResponse


class CorrectnessJudge(CustomScorer[NumericResponse]):
    name = "correctness_llm_judge"
    threshold = 0.8

    def __init__(self, client: OpenAI) -> None:
        self.client = client

    async def score(self, row: dict[str, Any]) -> NumericResponse:
        response = self.client.responses.create(
            model="gpt-4.1-mini",
            instructions=(
                "Grade correctness from 0.0 to 1.0. "
                "Return JSON only: {\"score\": 0.0, \"reason\": \"...\"}"
            ),
            input=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": json.dumps(
                                {
                                    "input": row.get("input", ""),
                                    "actual_output": row.get("actual_output", ""),
                                    "expected_output": row.get("expected_output", ""),
                                }
                            ),
                        }
                    ],
                }
            ],
        )

        parsed = json.loads(response.output_text or "{}")
        return NumericResponse(
            value=float(parsed["score"]),
            reason=str(parsed.get("reason", "")),
        )


judge_client = instrument_openai(OpenAI())

results = EvaluationClient().run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris is the capital of France.",
            "expected_output": "Paris",
        }
    ],
    scorers=[CorrectnessJudge(judge_client)],
)

Hosted scorers are also available out of the box and use OpenRouter by default:

  • ReferenceCorrectnessScorer
  • RubricScorer
  • GroundednessScorer
  • InstructionFollowingScorer
  • PairwisePreferenceScorer
from freesolo.evaluation import HostedJudgeClient, ReferenceCorrectnessScorer

judge = HostedJudgeClient(api_key="YOUR_OPENROUTER_API_KEY")

scorer = ReferenceCorrectnessScorer(client=judge)

Tracing is available through namespaced helpers:

from freesolo.tracing import start_trace

with start_trace("support-agent-run"):
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freesolo-0.2.3.tar.gz (135.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freesolo-0.2.3-py3-none-any.whl (77.5 kB view details)

Uploaded Python 3

File details

Details for the file freesolo-0.2.3.tar.gz.

File metadata

  • Download URL: freesolo-0.2.3.tar.gz
  • Upload date:
  • Size: 135.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for freesolo-0.2.3.tar.gz
Algorithm Hash digest
SHA256 23057d52bd5e6c8b6671e0dcd03645e7c48299541962b17e97bcc133681ecae4
MD5 31ce33a7cbf038421f01ac25a312c59f
BLAKE2b-256 34c526fd327d70a1d785cbb284b4a7c341aa8ef85ecb55d9cd860aa14169141e

See more details on using hashes here.

File details

Details for the file freesolo-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: freesolo-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 77.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for freesolo-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2febb781725f6926f098ce119932bbd0529cac3c7e36bbf3ad21166c6b397d9a
MD5 2df09d7d6e5395c4923ad84f595e9172
BLAKE2b-256 27ec48914cbb4121ba9af0939139a19c83c3d632e7221ca880c62c48b32f3390

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page