Skip to main content

LLM semantic testing and benchmarking framework

Project description

semtest

Enables semantic testing of LLM responses and benchmarking against result expectations.

Functionality

semtest supports the semantic benchmarking process through the following:

  1. Embedding vector generation against expected result set for a given input -> output expectation
  2. Execution of inputs against a given model
  3. Collection of llm responses for the given input
  4. Analysis of embedding vector difference of the LLM response and the expectation

Core Requirements

Executing benchmarks requires OpenAI configuration variables defined in a .env file in order to generate required embeddings

  • OPENAI_API_KEY="..."
  • BASE_URL="https://api.openai.com/v1"
  • DEFAULT_EMBEDDING_MODEL="text-embedding-3-large"

The latter two are defaulted to the values specified above.

Benchmarking in direct (non-framework) mode

Full example of this can be found in examples/benchmarking_example.ipynb.

To utilize this benchmarking, the @semtest.benchmark decorator can be used manually to execute a given function and obtain a series of results and distances from expected semantics.

Example

import semtest

expected_semantics = "A dog is in the background of the photograph"

@semtest.benchmark(
    semantic_expectation=expected_semantics,
    iterations=3
)
def mock_prompt_benchmark():
    """Mock example of benchmarking functionality."""

    # intermediary logic ...

    mocked_llm_response = query_llm(...)  # example llm autocomplete response

    return mocked_llm_response  # return LLM string response for benchmarking

res: semtest.Benchmark = mock_prompt_benchmark()  # manually executed benchmark (non-framework mode)

print(res.benchmarks())

Output

{
  "func": "mock_prompt_benchmark_prompt_2",
  "iterations": 3,
  "comparator": "cosine_similarity",
  "expectation_input": "A dog is in the background of the photograph",
  "benchmarks": {
    "responses": [
      "In the background of the photograph there is a furry animal",
      "In the foreground there is a human, and I see a dog in the background of the photograph",
      "There are two dogs in the background of the image"
    ],
    "exceptions": [],
    "semantic_distances": [
      0.7213289868782804,
      0.7029974291193597,
      0.7529891407174136
    ],
    "mean_semantic_distance": 0.7257718522383513,
    "median_semantic_distance": 0.7213289868782804
  }
}

Benchmarking in framework mode

Framework mode allows you to execute a series of prepared tests from a directory, similar to other testing frameworks (pyest, etc). Framework mode follows the same rules as direct execution mode as above, but with a few modifications, as the engine executes your tests (you do not call the benchmarks directly)

Running in framework mode is done with the following command: semtest <your_directory>

Due to it's automated nature, outputs are currently standardize to CLI where a dataframe is generated and output to the CLI. Additional options for data retrieval will be added later.

Framework mode requires:

  • A test directory with .py files containing your semtest.benchmark definitions
  • Each benchmark should return the llm response string you want to gauge (or a modified version of it)

See example_benchmarks directory for an example on structuring your semantic benchmarks. Example is runnable with semtest example_benchmarks.

Benchmark report: Benchmark Report

More granular benchmark-level output details are available within the CLI interface.

Caveats:

  • Framework mode does not currenty support fixtures
  • No relative imports within test directories due to treating every file as a top-level module

Ongoing features

  • Fixture support for framework mode
  • Support for Azure OpenAI embeddings
  • Support for multiple result output formats (non-CLI)
  • Allow for parameterization of benchmarks with multiple I/O expectations
  • Implement LLM response schema validation via Pydantic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semtest-0.0.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

semtest-0.0.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file semtest-0.0.0.tar.gz.

File metadata

  • Download URL: semtest-0.0.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for semtest-0.0.0.tar.gz
Algorithm Hash digest
SHA256 fabcdfbd2e36d3763eeb5a1494f12b3f2b58f04cf22083c399b67e1649542075
MD5 c45aef7bcc3119059f12142da27510a4
BLAKE2b-256 48da033a5413e2f031eaad476d2108a91c2d945637e34b1febed886c4e0fb8b3

See more details on using hashes here.

File details

Details for the file semtest-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: semtest-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for semtest-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98d326d1fb85c12235ebc8ad829101260a464ed64bf8e91984adefa1ef315051
MD5 3029d371da9271f3eff5082449e0e2d3
BLAKE2b-256 2a181ea42706c3197d4d9fa342ffe082275044fba1d6bc0f0d51fdb7bceb562d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page