sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.

Project description

sik-llm-eval

sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.

This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.

Get started with examples found in the examples folder.

sik-llm-eval is a fork of anaconda/llm-eval. I was the original author and principal contributor to the initial codebase while it was developed at Anaconda (last commit on June 12, 2025).

Using sik-llm-eval

In this framework, there are two fundamental concepts:

Eval: An Eval represents a single test scenario. Each Eval defines an input to an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks.
Candidate: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.

Examples

Running Evals/Candidates from YAML files

You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:

model: gpt-4o-mini
candidate_type: OPENAI
metadata:
  name: OpenAI GPT-4o-mini
parameters:
  temperature: 0.01
  max_tokens: 4096
  seed: 42

Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:

metadata:
  name: Fibonacci Sequence
input:
  - role: user
    content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
  - check_type: REGEX
    pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
  - check_type: PYTHON_CODE_BLOCK_TESTS
    code_setup: import re
    code_tests:
    - |
      def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
          value = 'This is a string with no email addresses'
          return mask_emails(value) == value

The Eval above defines various types of checks, including a PYTHON_CODE_BLOCK_TESTS check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).

The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an EvalHarness:

from sik_llm_eval.eval import EvalHarness

eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()

print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)

results contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, results is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).

Note that you can load multiple YAML files in a directory using add_evals_from_yamls and add_candidates_from_yamls:

...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...

Installing

uv install sik-llm-eval or pip install sik-llm-eval

Environment Variables

The following environment variables are required for using the built-in OpenAI and Anthropic Candidates:

OPENAI_API_KEY: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.
ANTHROPIC_API_KEY: This environment variable and API key are required for using AnthropicChat and AnthropicCandidate.

Contributing

If you would like to contribute to sik-llm-eval, please fork the repository and submit a pull request.

See Makefile for building environment and running tests.

Project details

Release history Release notifications | RSS feed

This version

0.0.2

Jun 14, 2025

0.0.1

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sik_llm_eval-0.0.2.tar.gz (1.7 MB view details)

Uploaded Jun 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sik_llm_eval-0.0.2-py3-none-any.whl (4.1 kB view details)

Uploaded Jun 14, 2025 Python 3

File details

Details for the file sik_llm_eval-0.0.2.tar.gz.

File metadata

Download URL: sik_llm_eval-0.0.2.tar.gz
Upload date: Jun 14, 2025
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for sik_llm_eval-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c0fd1797d8740e7eea754397f0b0f2202b51b99dcccb19b8ff33e2af8595f6ca`
MD5	`7d85b9b35b7dda17982cc962bce21b87`
BLAKE2b-256	`a77a72db32b01c4e2f8a9fbc929d5beb36418016c760d1329c4a0e84cb302041`

See more details on using hashes here.

File details

Details for the file sik_llm_eval-0.0.2-py3-none-any.whl.

File metadata

Download URL: sik_llm_eval-0.0.2-py3-none-any.whl
Upload date: Jun 14, 2025
Size: 4.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for sik_llm_eval-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f567b9a93bf9001117d4fb2a9e9c9c9c9b2a01d81f07ff777fb14c38baf101e7`
MD5	`243669dd00250d6d6375efa89bfc56c6`
BLAKE2b-256	`dfa710b8532eb52a5c10c4918fff0f6dab5616cb1af83b9823f48926a75042e4`

See more details on using hashes here.

sik-llm-eval 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

sik-llm-eval

Using sik-llm-eval

Examples

Running Evals/Candidates from YAML files

Installing

Environment Variables

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes