sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.
Project description
sik-llm-eval
sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.
This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.
Get started with examples found in the examples folder.
sik-llm-evalis a fork ofanaconda/llm-eval. I was the original author and principal contributor to the initial codebase while it was developed at Anaconda (last commit on June 12, 2025).
Using sik-llm-eval
In this framework, there are two fundamental concepts:
- Eval: An Eval represents a single test scenario. Each Eval defines an
inputto an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks. - Candidate: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.
Examples
Running Evals/Candidates from YAML files
You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:
model: gpt-4o-mini
candidate_type: OPENAI
metadata:
name: OpenAI GPT-4o-mini
parameters:
temperature: 0.01
max_tokens: 4096
seed: 42
Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:
metadata:
name: Fibonacci Sequence
input:
- role: user
content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
- check_type: REGEX
pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
- check_type: PYTHON_CODE_BLOCK_TESTS
code_setup: import re
code_tests:
- |
def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
value = 'This is a string with no email addresses'
return mask_emails(value) == value
The Eval above defines various types of checks, including a PYTHON_CODE_BLOCK_TESTS check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).
The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an EvalHarness:
from sik_llm_eval.eval import EvalHarness
eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()
print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)
results contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, results is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).
Note that you can load multiple YAML files in a directory using add_evals_from_yamls and add_candidates_from_yamls:
...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...
Installing
uv install sik-llm-eval or pip install sik-llm-eval
Environment Variables
The following environment variables are required for using the built-in OpenAI and Anthropic Candidates:
OPENAI_API_KEY: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.ANTHROPIC_API_KEY: This environment variable and API key are required for using AnthropicChat and AnthropicCandidate.
Contributing
If you would like to contribute to sik-llm-eval, please fork the repository and submit a pull request.
See Makefile for building environment and running tests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sik_llm_eval-0.0.2.tar.gz.
File metadata
- Download URL: sik_llm_eval-0.0.2.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0fd1797d8740e7eea754397f0b0f2202b51b99dcccb19b8ff33e2af8595f6ca
|
|
| MD5 |
7d85b9b35b7dda17982cc962bce21b87
|
|
| BLAKE2b-256 |
a77a72db32b01c4e2f8a9fbc929d5beb36418016c760d1329c4a0e84cb302041
|
File details
Details for the file sik_llm_eval-0.0.2-py3-none-any.whl.
File metadata
- Download URL: sik_llm_eval-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f567b9a93bf9001117d4fb2a9e9c9c9c9b2a01d81f07ff777fb14c38baf101e7
|
|
| MD5 |
243669dd00250d6d6375efa89bfc56c6
|
|
| BLAKE2b-256 |
dfa710b8532eb52a5c10c4918fff0f6dab5616cb1af83b9823f48926a75042e4
|