Skip to main content

sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.

Project description

test

sik-llm-eval

sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.

This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.

Get started with examples found in the examples folder.

sik-llm-eval is a fork of anaconda/llm-eval. I was the original author and principal contributor to the initial codebase while it was developed at Anaconda (last commit on June 12, 2025).


Using sik-llm-eval

In this framework, there are two fundamental concepts:

  • Eval: An Eval represents a single test scenario. Each Eval defines an input to an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks.
  • Candidate: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.

Examples

Running Evals/Candidates from YAML files

You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:

model: gpt-4o-mini
candidate_type: OPENAI
metadata:
  name: OpenAI GPT-4o-mini
parameters:
  temperature: 0.01
  max_tokens: 4096
  seed: 42

Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:

metadata:
  name: Fibonacci Sequence
input:
  - role: user
    content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
  - check_type: REGEX
    pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
  - check_type: PYTHON_CODE_BLOCK_TESTS
    code_setup: import re
    code_tests:
    - |
      def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
          value = 'This is a string with no email addresses'
          return mask_emails(value) == value

The Eval above defines various types of checks, including a PYTHON_CODE_BLOCK_TESTS check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).

The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an EvalHarness:

from sik_llm_eval.eval import EvalHarness

eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()

print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)

results contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, results is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).

Note that you can load multiple YAML files in a directory using add_evals_from_yamls and add_candidates_from_yamls:

...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...

Installing

uv install sik-llm-eval or pip install sik-llm-eval

Environment Variables

The following environment variables are required for using the built-in OpenAI and Anthropic Candidates:

  • OPENAI_API_KEY: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.
  • ANTHROPIC_API_KEY: This environment variable and API key are required for using AnthropicChat and AnthropicCandidate.

Contributing

If you would like to contribute to sik-llm-eval, please fork the repository and submit a pull request.

See Makefile for building environment and running tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sik_llm_eval-0.0.2.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sik_llm_eval-0.0.2-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file sik_llm_eval-0.0.2.tar.gz.

File metadata

  • Download URL: sik_llm_eval-0.0.2.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.5

File hashes

Hashes for sik_llm_eval-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c0fd1797d8740e7eea754397f0b0f2202b51b99dcccb19b8ff33e2af8595f6ca
MD5 7d85b9b35b7dda17982cc962bce21b87
BLAKE2b-256 a77a72db32b01c4e2f8a9fbc929d5beb36418016c760d1329c4a0e84cb302041

See more details on using hashes here.

File details

Details for the file sik_llm_eval-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for sik_llm_eval-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f567b9a93bf9001117d4fb2a9e9c9c9c9b2a01d81f07ff777fb14c38baf101e7
MD5 243669dd00250d6d6375efa89bfc56c6
BLAKE2b-256 dfa710b8532eb52a5c10c4918fff0f6dab5616cb1af83b9823f48926a75042e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page