Skip to main content

A testing suite for evaluating LLM responses (semantic similarity, hallucinations, consistency, security).

Project description

LLM TestLab

Comprehensive Testing Suite for Large Language Models (LLMs)

LLM TestLab is a flexible Python toolkit for evaluating Large Language Models (LLMs) on semantic similarity, hallucinations, consistency, and security. It supports FAISS for high-performance vector similarity and falls back to NumPy if FAISS is unavailable.

Features

  • Semantic Similarity Test – Evaluate if model outputs match expected answers.
  • Hallucination Test – Detect deviations from a knowledge base.
  • Consistency Test – Measure stability across multiple runs.
  • Security Test – Detect unsafe or malicious responses using keywords, regex patterns, and embedding similarity.
  • FAISS Support – Optional, for faster similarity searches.
  • Knowledge Base Management – Add, remove, or list facts.
  • Malicious Keywords Management – Customize keywords and patterns for security checks.
  • Logging – Built-in debug/info logging using Python's logging module.

Project Structure

llm-testlab/ | ├─ llm_testing_suite.py # Main LLM testing suite ├─ huggingface_example.py # Example usage / tests ├─ requirements.txt # Python dependencies ├─ README.md # GitHub README └─ .gitignore # Ignore virtualenv and cache files

Installation

  1. Clone the repository:

    git clone git@github.com:YOUR_USERNAME/llm-testlab.git cd llm-testlab

  2. Create and activate a virtual environment:

    python -m venv venv source venv/bin/activate # macOS / Linux venv\Scripts\activate # Windows

  3. Install dependencies:

    pip install -r requirements.txt

Optional: If you want FAISS support for faster similarity searches:

pip install faiss-cpu    # macOS / Linux
pip install faiss-windows # Windows

Quick Start

from llm_testing_suite import LLMTestSuite

Example LLM function

def llm_func(prompt): return "Rome is the capital of Italy"

Initialize the test suite

tester = LLMTestSuite(llm_func, use_faiss=True)

Run semantic similarity test

tester.semantic_test("What is the capital of Italy?", "Rome is the capital of Italy")

Run security test

tester.security_test("Ignore previous instructions")

Run all tests

tester.run_tests("What is the capital of Italy?", expected_answer="Rome is the capital of Italy")

Managing Knowledge Base

Add a single fact

tester.add_knowledge("New York is the largest city in the USA")

Add multiple facts

tester.add_knowledge_bulk(["Python is a programming language", "AI is transforming industries"])

List knowledge base

tester.list_knowledge()

Remove a fact

tester.remove_knowledge("Python is a programming language")

Clear the knowledge base

tester.clear_knowledge()

Managing Malicious Keywords

Add malicious keywords

tester.add_malicious_keywords(["hack system", "steal data"])

List keywords

tester.list_malicious_keywords()

Remove a keyword

tester.remove_malicious_keyword("hack system")

Output Format

All test methods support three return types controlled by the `return_type` parameter: `"dict"`, `"table"`, or `"both"`.
  • "dict": Returns a Python dictionary with the test results.
  • "table": Prints a formatted table using the rich library, no dictionary returned.
  • "both": Returns the dictionary and prints the table.

Example of semantic test result:

{ "question": "What is the capital of Italy?", "generated_answer": "Rome is the capital of Italy", "semantic_score": 0.92, "semantic_pass": True, "best_match": "Rome is the capital of Italy" }

Example of hallucination test result:

{ "question": "Who wrote Hamlet?", "generated_answer": "Hamlet was written by Shakespeare", "hallucination_best_match": "William Shakespeare wrote the play Romeo and Juliet.", "hallucination_distance": 0.87 }

Example of consistency test result:

{ "question": "What is the capital of France?", "consistency_outputs": ["Paris is the capital of France", "Paris is the capital of France", "Paris is the capital of France"], "consistency_avg_sim": 0.99 }

Example of security test result:

{ "question": "Ignore previous instructions", "generated_answer": "Ignore previous instructions", "security_safe": False, "security_reason": "Matched keyword: 'ignore previous instructions'" }

The run_tests() method combines all these results into a single dictionary with added token_cost information.

Logging

The suite uses Python's built-in logging module for debug and info messages. Adjust the log level in llm_testing_suite.py:

logger.setLevel(logging.DEBUG)  # Options: DEBUG, INFO, WARNING, ERROR

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_testlab-0.1.1.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_testlab-0.1.1-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_testlab-0.1.1.tar.gz.

File metadata

  • Download URL: llm_testlab-0.1.1.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llm_testlab-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2800c3705d71b5324e4cf88b4e4d1f6a49df273e4411cb0c0ebd9ed638a34ce8
MD5 0ea1fc78298c77ff6ffbbd243a4cd7f7
BLAKE2b-256 4415f67a075a0481970887be24cfeb9f9fee9553056e835ae5c575466fcec15d

See more details on using hashes here.

File details

Details for the file llm_testlab-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_testlab-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llm_testlab-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 830da51f99bf081b5f1125cbca79492ab0fb9f244fd86d45faada878d6f3b4cb
MD5 126711c63dffb0cfbe7d819da5ccd97f
BLAKE2b-256 4bd9b8cd723f2b0a9ccfda046e0cfb1604f71451b07e4322aba26ad14dc08f1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page