Skip to main content

LLM semantic testing and benchmarking framework

Project description

semtest

Enables semantic testing of LLM responses and benchmarking against result expectations.

PyPi

Install the latest version of semtest here

Functionality

semtest supports the semantic benchmarking process through the following:

  1. Embedding vector generation against expected result set for a given input -> output expectation
  2. Execution of inputs against a given model
  3. Collection of llm responses for the given input
  4. Analysis of embedding vector difference of the llm response and the expectation
  5. Soon - generic comparator implementation

Core Requirements

Executing benchmarks requires OpenAI configuration variables defined in a .env file in order to generate required embeddings

  • OPENAI_API_KEY="..."
  • BASE_URL="https://api.openai.com/v1"
  • DEFAULT_EMBEDDING_MODEL="text-embedding-3-large"

The latter two are defaulted to the values specified above.

Benchmarking in direct (non-framework) mode

Full example of this can be found in examples/benchmarking_example.ipynb.

To utilize this benchmarking, the @semtest.benchmark decorator can be used manually to execute a given function and obtain a series of results and distances from expected semantics.

Example

import semtest


cosine_similarity = semtest.CosineSimilarity(
    semantic_expectation="A dog is in the background of the photograph"
)

# First test picked up and executed by framework mode
@semtest.benchmark(
    comparator=cosine_similarity,
    iterations=3
)
def mock_prompt_benchmark():
    """Mock example of benchmarking functionality."""

    # intermediary logic ...

    mocked_llm_response = query_llm(...)  # example llm autocomplete response

    return mocked_llm_response  # return LLM string response for benchmarking

res: semtest.Benchmark = mock_prompt_benchmark()  # manually executed benchmark (non-framework mode)

print(res.benchmarks())

Output

{
  "func": "mock_prompt_benchmark",
  "iterations": 3,
  "comparator": "cosine_similarity",
  "expectation_input": "A dog is in the background of the photograph",
  "benchmarks": {
    "responses": [
      "In the background of the photograph there is a furry animal",
      "In the foreground there is a human, and I see a dog in the background of the photograph",
      "There are two dogs in the background of the image"
    ],
    "exceptions": [],
    "semantic_distances": [
      0.7213289868782804,
      0.7029974291193597,
      0.7529891407174136
    ],
    "mean_semantic_distance": 0.7257718522383513,
    "median_semantic_distance": 0.7213289868782804
  }
}

Benchmarking in framework mode

Framework mode allows you to execute a series of prepared tests from a directory, similar to other testing frameworks (pytest, etc). Framework mode follows the same rules as direct execution mode as above, but with a few modifications, as the engine executes your tests (you do not call the benchmarks directly).

Running in framework mode is done with the following command: semtest <your_directory>.

Due to it's automated nature, outputs are currently generated as a dataframe and output to the CLI. Additional options for data retrieval will be added later.

Framework mode requires:

  • A test directory with .py files containing your semtest.benchmark definitions
  • Each benchmark should return the llm response string you want to gauge (or a modified/parsed version of it)

See example_benchmarks directory for an example on structuring your semantic benchmarks. Example is runnable with semtest example_benchmarks.

Benchmark report: Benchmark Report

More granular benchmark-level output details are available within the CLI interface.

Caveats:

  • Framework mode does not currenty support fixtures
  • No relative imports within test directories due to treating every file as a top-level module

Ongoing features

  • Implement flexible comparators:
    • response schema validation (SchemaValidator) & ground truth k/v validator (GroundTruthValidator)
  • Fixture support for framework mode
  • Support for multiple result output formats (non-CLI)
  • Allow for parameterization of benchmarks with multiple I/O expectations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semtest-0.0.1.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semtest-0.0.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file semtest-0.0.1.tar.gz.

File metadata

  • Download URL: semtest-0.0.1.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for semtest-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4a3a741492f1c839dc2196609454d9c70b6e15509989068da6c4fddb51abb7f8
MD5 5fc4666321579d9ba1dda28b08cb6f73
BLAKE2b-256 d5b7d55b9aa32567f850739778a56959d3026308297fd5775012db6c3f039037

See more details on using hashes here.

File details

Details for the file semtest-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: semtest-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for semtest-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e04829209a758766276d4ee40e2abcb8c2a12f9bded772a06db2634473f66b73
MD5 997dcf06fddee59b6bae6b7f977acd9f
BLAKE2b-256 61b76bca62e6aaa824536923373fb3ccbc9ac3daceb83944f2d8c931b5dd2c2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page