Project description

semtest

Enables semantic testing of LLM responses and benchmarking against result expectations.

Functionality

semtest supports the semantic benchmarking process through the following:

Embedding vector generation against expected result set for a given input -> output expectation
Execution of inputs against a given model
Collection of llm responses for the given input
Analysis of embedding vector difference of the LLM response and the expectation

Core Requirements

Executing benchmarks requires OpenAI configuration variables defined in a .env file in order to generate required embeddings

OPENAI_API_KEY="..."
BASE_URL="https://api.openai.com/v1"
DEFAULT_EMBEDDING_MODEL="text-embedding-3-large"

The latter two are defaulted to the values specified above.

Benchmarking in direct (non-framework) mode

Full example of this can be found in examples/benchmarking_example.ipynb.

To utilize this benchmarking, the @semtest.benchmark decorator can be used manually to execute a given function and obtain a series of results and distances from expected semantics.

Example

import semtest

expected_semantics = "A dog is in the background of the photograph"

@semtest.benchmark(
    semantic_expectation=expected_semantics,
    iterations=3
)
def mock_prompt_benchmark():
    """Mock example of benchmarking functionality."""

    # intermediary logic ...

    mocked_llm_response = query_llm(...)  # example llm autocomplete response

    return mocked_llm_response  # return LLM string response for benchmarking

res: semtest.Benchmark = mock_prompt_benchmark()  # manually executed benchmark (non-framework mode)

print(res.benchmarks())

Output

{
  "func": "mock_prompt_benchmark_prompt_2",
  "iterations": 3,
  "comparator": "cosine_similarity",
  "expectation_input": "A dog is in the background of the photograph",
  "benchmarks": {
    "responses": [
      "In the background of the photograph there is a furry animal",
      "In the foreground there is a human, and I see a dog in the background of the photograph",
      "There are two dogs in the background of the image"
    ],
    "exceptions": [],
    "semantic_distances": [
      0.7213289868782804,
      0.7029974291193597,
      0.7529891407174136
    ],
    "mean_semantic_distance": 0.7257718522383513,
    "median_semantic_distance": 0.7213289868782804
  }
}

Benchmarking in framework mode

Framework mode allows you to execute a series of prepared tests from a directory, similar to other testing frameworks (pyest, etc). Framework mode follows the same rules as direct execution mode as above, but with a few modifications, as the engine executes your tests (you do not call the benchmarks directly)

Running in framework mode is done with the following command: semtest <your_directory>

Due to it's automated nature, outputs are currently standardize to CLI where a dataframe is generated and output to the CLI. Additional options for data retrieval will be added later.

Framework mode requires:

A test directory with .py files containing your semtest.benchmark definitions
Each benchmark should return the llm response string you want to gauge (or a modified version of it)

See example_benchmarks directory for an example on structuring your semantic benchmarks. Example is runnable with semtest example_benchmarks.

Benchmark report:

More granular benchmark-level output details are available within the CLI interface.

Caveats:

Framework mode does not currenty support fixtures
No relative imports within test directories due to treating every file as a top-level module

Ongoing features

Fixture support for framework mode
Support for Azure OpenAI embeddings
Support for multiple result output formats (non-CLI)
Allow for parameterization of benchmarks with multiple I/O expectations
Implement LLM response schema validation via Pydantic

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.0

Sep 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semtest-0.0.0.tar.gz (12.9 kB view hashes)

Uploaded Sep 14, 2024 Source

Built Distribution

semtest-0.0.0-py3-none-any.whl (17.1 kB view hashes)

Uploaded Sep 14, 2024 Python 3

Hashes for semtest-0.0.0.tar.gz

Hashes for semtest-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`fabcdfbd2e36d3763eeb5a1494f12b3f2b58f04cf22083c399b67e1649542075`
MD5	`c45aef7bcc3119059f12142da27510a4`
BLAKE2b-256	`48da033a5413e2f031eaad476d2108a91c2d945637e34b1febed886c4e0fb8b3`

Hashes for semtest-0.0.0-py3-none-any.whl

Hashes for semtest-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98d326d1fb85c12235ebc8ad829101260a464ed64bf8e91984adefa1ef315051`
MD5	`3029d371da9271f3eff5082449e0e2d3`
BLAKE2b-256	`2a181ea42706c3197d4d9fa342ffe082275044fba1d6bc0f0d51fdb7bceb562d`