Skip to main content

Tracing, evaluation, and training utilities for LLM applications.

Project description

freesolo

freesolo is a Python tracing and evaluation package for LLM apps.

It is built for the lowest-friction integration possible:

  1. Install the package
  2. Set FREESOLO_API_KEY
  3. Configure the tracer
  4. Run traces and evaluations from the package APIs

Install

Install the package:

pip install freesolo

Environment

  • FREESOLO_API_KEY
  • FREESOLO_BASE_URL (optional, defaults to https://api.freesolo.co)
export FREESOLO_API_KEY=fslo_...

Quickstart

from freesolo.tracing import configure_tracer, get_tracer

configure_tracer(service_name="my-llm-app")
tracer = get_tracer()

with tracer.start_as_current_span(
    "model.call",
    attributes={
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-5.5",
        "freesolo.input": {"prompt": "How do I reset my password?"},
    },
) as span:
    result = "Reset it from account settings."
    span.set_attribute("freesolo.output", result)

Runnable Examples

Copy-pasteable examples live in examples/:

  • tracing_manual_span.py: configure OpenTelemetry and send one application span.
  • evaluation_custom_scorer.py: run custom binary and numeric eval scorers.
  • evaluation_from_files.py: run evals from a concrete dataset and environment.
  • environment.py: example environment used by evals, training, and GEPA.
  • support_dataset.py: example dataset paths and loaders used by evals, SFT, GRPO, and GEPA.
  • gepa_prompt_example.py: run the Freesolo GEPA adapter over the example dataset.
  • training_sft_grpo.py: start SFT or GRPO training runs from package APIs.

From a repo checkout:

cd freesolo-sdk
export PYTHONPATH="$PWD/pypi"
uv run python examples/evaluation_custom_scorer.py --local

Public API

The root freesolo module intentionally exports no functions. Import from the subpackages below; lower-level modules may be importable, but they are implementation helpers unless they appear here or in an example.

Import Use case
freesolo.tracing.configure_tracer, get_tracer, force_flush, shutdown Send OpenTelemetry traces from an application to Freesolo.
freesolo.evaluation.EvaluationClient Run custom-scorer evals or environment evals and upload results to Freesolo.
freesolo.evaluation.run_local_evaluation Run custom scorers locally without uploading results.
freesolo.evaluation.CustomScorer, BinaryResponse, NumericResponse Define local scorer logic for eval rows.
freesolo.evaluation.HostedJudgeClient and hosted scorer classes Use hosted LLM-as-judge scorers with OpenRouter-compatible credentials.
freesolo.datasets.TaskExample, Dataset, load_dataset Load task examples and construct labeled conversations for evals or training.
freesolo.environments.Environment, RewardResult, RewardMetric, GrpoConfig, EnvironmentGeneration Define task behavior once for evals, GEPA, SFT, and GRPO.
freesolo.training.SftConfig, TrainGrpoOptions, train_sft, train_grpo Start SFT or GRPO training from package APIs.
freesolo.gepa.GEPASetup, GEPAConfig, DefaultReflectionAgent, attach_gepa, optimize_gepa Optimize prompts through the GEPA adapter using the same environment and dataset abstractions.
freesolo.contracts.load_contract_text, extract_contract_spec, load_contract_spec, build_oracle_messages Read contract markdown and build oracle prompt messages.
freesolo.utils.oracle.generate_ground_truth_records Generate ground-truth JSONL records from source examples using a contract, environment, and oracle model.
freesolo.utils.upload.upload_tinker_checkpoint_to_huggingface Upload a Tinker checkpoint to a private Hugging Face model repo.

What Gets Stored

  • Native OTLP traces and spans
  • Resource attributes like service.name
  • Span names, timings, parent span ids, status, and errors
  • Common model attributes such as gen_ai.system, gen_ai.request.model, and token counts
  • Optional freesolo.input and freesolo.output span attributes

Notes

  • Tracing uses native OpenTelemetry protobuf export to /api/traces/ingest.
  • Configure third-party OpenTelemetry instrumentors against the provider returned by configure_tracer(...).
  • Delivery is handled by the OpenTelemetry span processor you configure.

Evaluations

freesolo also includes a small evaluation API for CI jobs, GitHub bots, and eval scripts. All evaluation runs require FREESOLO_API_KEY or an explicit api_key.

Evaluation data is a list of plain dictionaries. There is no separate Example class to construct.

Define scorers by subclassing CustomScorer and returning BinaryResponse or NumericResponse. Scorers run in your process, and Freesolo uploads the final results with your API key. Pass scorer objects, not strings.

from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class ExactMatch(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        actual = str(row.get("actual_output", "")).strip()
        expected = str(row.get("expected_output", "")).strip()
        return BinaryResponse(
            value=actual == expected and bool(actual),
            reason="actual_output matched expected_output",
        )


client = EvaluationClient()

results = client.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris",
            "expected_output": "Paris",
        }
    ],
    scorers=[ExactMatch()],
)

print(results[0].success)

Tinker Hugging Face Upload

freesolo.utils.upload posts a Tinker checkpoint URL to the Freesolo upload service and returns the Hugging Face upload response.

from freesolo.utils.upload import upload_tinker_checkpoint_to_huggingface

result = upload_tinker_checkpoint_to_huggingface(
    "tinker://<run_id>/sampler_weights/final",
    base_model="Qwen/Qwen3.5-35B-A3B",
)

print(result["repoId"])

Environment-driven evaluations

For training contracts, Environment describes task behavior for evals and GRPO/RL: prompt construction, response normalization, and reward scoring. Dataset loading and labeled conversation construction live in freesolo.datasets. run_environment loads task examples, calls your model callback, scores the response through the environment, and uploads the same scorers_data shape used by the eval DB.

from typing import Any

from openai import OpenAI

from freesolo.datasets import TaskExample
from freesolo.environments import (
    Environment,
    EnvironmentGeneration,
    RewardMetric,
    RewardResult,
)
from freesolo.evaluation import EvaluationClient


class PromptEnvironment(Environment):
    def build_prompt_messages(
        self,
        example: TaskExample,
        prompt_text: str,
    ):
        return [
            {"role": "system", "content": prompt_text},
            {"role": "user", "content": example.task},
        ]

    def score_response(
        self,
        example: TaskExample,
        response_text: str,
    ) -> RewardResult:
        passed = response_text.strip() == str(example.expected_output).strip()
        return RewardResult(
            name="exact_match",
            score=1.0 if passed else 0.0,
            success=passed,
            threshold=1.0,
            reason="matched expected output" if passed else "mismatch",
            return_type="binary",
            metrics=(
                RewardMetric(
                    name="canonical_match",
                    score=1.0 if passed else 0.0,
                    success=passed,
                    threshold=1.0,
                ),
            ),
        )


model = OpenAI()


def generate(messages: list[dict[str, str]], example: TaskExample):
    response = model.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
    )
    return EnvironmentGeneration(
        response_text=response.choices[0].message.content or "",
        total_tokens=response.usage.total_tokens if response.usage else None,
    )


results = EvaluationClient().run_environment(
    name="contract-eval",
    source="eval.jsonl",
    contract_path="TRAINING_CONTRACT.md",
    environment=ContractEnvironment(),
    generate=generate,
)

RewardResult is the top-level scorer entry stored in eval_tasks.scorers_data. Its fields are:

  • name: scorer name shown in the UI.
  • score: numeric reward value.
  • success: pass/fail. If omitted, Freesolo derives it from threshold, then from whether score > 0.
  • threshold, value, reason, error, return_type: scorer display and pass/fail context.
  • latency_ms, total_tokens: optional per-response usage metadata.
  • metadata: JSON object for scorer-specific details.
  • metrics: optional RewardMetric components, also JSON-only, with name, score, value, success, threshold, weight, reason, and metadata.

Custom scorer:

from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class NoEmptyAnswer(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        ok = bool(str(row.get("actual_output", "")).strip())
        return BinaryResponse(value=ok, reason="actual_output is non-empty")


results = EvaluationClient().run(
    name="support-agent-non-empty",
    data=[{"actual_output": "hello"}],
    scorers=[NoEmptyAnswer()],
)

LLM-as-judge is also a custom scorer. The scorer can call your judge model and return a NumericResponse; Freesolo stores the eval run and score output with your FREESOLO_API_KEY. This example uses OPENAI_API_KEY for the judge model call and FREESOLO_API_KEY for eval upload.

import json
from typing import Any

from openai import OpenAI

from freesolo.evaluation import CustomScorer, EvaluationClient, NumericResponse


class CorrectnessJudge(CustomScorer[NumericResponse]):
    name = "correctness_llm_judge"
    threshold = 0.8

    def __init__(self, client: OpenAI) -> None:
        self.client = client

    async def score(self, row: dict[str, Any]) -> NumericResponse:
        response = self.client.responses.create(
            model="gpt-4.1-mini",
            instructions=(
                "Grade correctness from 0.0 to 1.0. "
                "Return JSON only: {\"score\": 0.0, \"reason\": \"...\"}"
            ),
            input=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": json.dumps(
                                {
                                    "input": row.get("input", ""),
                                    "actual_output": row.get("actual_output", ""),
                                    "expected_output": row.get("expected_output", ""),
                                }
                            ),
                        }
                    ],
                }
            ],
        )

        parsed = json.loads(response.output_text or "{}")
        return NumericResponse(
            value=float(parsed["score"]),
            reason=str(parsed.get("reason", "")),
        )


judge_client = OpenAI()

results = EvaluationClient().run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris is the capital of France.",
            "expected_output": "Paris",
        }
    ],
    scorers=[CorrectnessJudge(judge_client)],
)

Hosted scorers are also available out of the box and use OpenRouter by default:

  • ReferenceCorrectnessScorer
  • RubricScorer
  • GroundednessScorer
  • InstructionFollowingScorer
  • PairwisePreferenceScorer
from freesolo.evaluation import HostedJudgeClient, ReferenceCorrectnessScorer

judge = HostedJudgeClient(api_key="YOUR_OPENROUTER_API_KEY")

scorer = ReferenceCorrectnessScorer(client=judge)

Tracing is available through the OpenTelemetry helpers in freesolo.tracing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freesolo-0.2.4.tar.gz (274.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freesolo-0.2.4-py3-none-any.whl (67.5 kB view details)

Uploaded Python 3

File details

Details for the file freesolo-0.2.4.tar.gz.

File metadata

  • Download URL: freesolo-0.2.4.tar.gz
  • Upload date:
  • Size: 274.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for freesolo-0.2.4.tar.gz
Algorithm Hash digest
SHA256 a8d9335a31070f4cfd59a5bc888042a8a87fee63f9da0a63448e5ed90cd56c19
MD5 5869029bc1619ee65129a5f6d1227618
BLAKE2b-256 2283f1a2b13cc533ad50e29ab8c32d1b26182f5036897b6b57779d024890305d

See more details on using hashes here.

File details

Details for the file freesolo-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: freesolo-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 67.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for freesolo-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 523292891d90073e43effda1d244fb5c0d7c4407fdbbab2052717b2c08e9ff94
MD5 f90a73be9b7fd4fc83bf2f4c353453dd
BLAKE2b-256 8022211e88b302ebf01aa3b087fa70cb9999e7f4e43aac6620642ae1dac3763d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page