LLM semantic testing and benchmarking framework
Project description
semtest
Enables semantic testing of LLM responses and benchmarking against result expectations.
Functionality
semtest supports the semantic benchmarking process through the following:
- Embedding vector generation against expected result set for a given input -> output expectation
- Execution of inputs against a given model
- Collection of llm responses for the given input
- Analysis of embedding vector difference of the LLM response and the expectation
Core Requirements
Executing benchmarks requires OpenAI configuration variables defined in a .env file in order to generate required embeddings
OPENAI_API_KEY="..."
BASE_URL="https://api.openai.com/v1"
DEFAULT_EMBEDDING_MODEL="text-embedding-3-large"
The latter two are defaulted to the values specified above.
Benchmarking in direct (non-framework) mode
Full example of this can be found in examples/benchmarking_example.ipynb
.
To utilize this benchmarking, the @semtest.benchmark
decorator can be used manually to execute a given function and obtain a series of results and distances from expected semantics.
Example
import semtest
expected_semantics = "A dog is in the background of the photograph"
@semtest.benchmark(
semantic_expectation=expected_semantics,
iterations=3
)
def mock_prompt_benchmark():
"""Mock example of benchmarking functionality."""
# intermediary logic ...
mocked_llm_response = query_llm(...) # example llm autocomplete response
return mocked_llm_response # return LLM string response for benchmarking
res: semtest.Benchmark = mock_prompt_benchmark() # manually executed benchmark (non-framework mode)
print(res.benchmarks())
Output
{
"func": "mock_prompt_benchmark_prompt_2",
"iterations": 3,
"comparator": "cosine_similarity",
"expectation_input": "A dog is in the background of the photograph",
"benchmarks": {
"responses": [
"In the background of the photograph there is a furry animal",
"In the foreground there is a human, and I see a dog in the background of the photograph",
"There are two dogs in the background of the image"
],
"exceptions": [],
"semantic_distances": [
0.7213289868782804,
0.7029974291193597,
0.7529891407174136
],
"mean_semantic_distance": 0.7257718522383513,
"median_semantic_distance": 0.7213289868782804
}
}
Benchmarking in framework mode
Framework mode allows you to execute a series of prepared tests from a directory, similar to other testing frameworks (pyest, etc). Framework mode follows the same rules as direct execution mode as above, but with a few modifications, as the engine executes your tests (you do not call the benchmarks directly)
Running in framework mode is done with the following command: semtest <your_directory>
Due to it's automated nature, outputs are currently standardize to CLI where a dataframe is generated and output to the CLI. Additional options for data retrieval will be added later.
Framework mode requires:
- A test directory with .py files containing your semtest.benchmark definitions
- Each benchmark should return the llm response string you want to gauge (or a modified version of it)
See example_benchmarks
directory for an example on structuring your semantic benchmarks. Example is runnable with semtest example_benchmarks
.
Benchmark report:
More granular benchmark-level output details are available within the CLI interface.
Caveats:
- Framework mode does not currenty support fixtures
- No relative imports within test directories due to treating every file as a top-level module
Ongoing features
- Fixture support for framework mode
- Support for Azure OpenAI embeddings
- Support for multiple result output formats (non-CLI)
- Allow for parameterization of benchmarks with multiple I/O expectations
- Implement LLM response schema validation via Pydantic
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file semtest-0.0.0.tar.gz
.
File metadata
- Download URL: semtest-0.0.0.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fabcdfbd2e36d3763eeb5a1494f12b3f2b58f04cf22083c399b67e1649542075 |
|
MD5 | c45aef7bcc3119059f12142da27510a4 |
|
BLAKE2b-256 | 48da033a5413e2f031eaad476d2108a91c2d945637e34b1febed886c4e0fb8b3 |
File details
Details for the file semtest-0.0.0-py3-none-any.whl
.
File metadata
- Download URL: semtest-0.0.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98d326d1fb85c12235ebc8ad829101260a464ed64bf8e91984adefa1ef315051 |
|
MD5 | 3029d371da9271f3eff5082449e0e2d3 |
|
BLAKE2b-256 | 2a181ea42706c3197d4d9fa342ffe082275044fba1d6bc0f0d51fdb7bceb562d |