A testing suite for evaluating LLM responses (semantic similarity, hallucinations, consistency, security).

These details have not been verified by PyPI

Project links

Project description

LLM TestLab

Comprehensive Testing Suite for Large Language Models (LLMs)

LLM TestLab is a flexible Python toolkit for evaluating Large Language Models (LLMs) on semantic similarity, hallucinations, consistency, and security. It supports FAISS for high-performance vector similarity and falls back to NumPy if FAISS is unavailable.

Features

Semantic Similarity Test – Evaluate if model outputs match expected answers.
Hallucination Test – Detect deviations from a knowledge base.
Consistency Test – Measure stability across multiple runs.
Security Test – Detect unsafe or malicious responses using keywords, regex patterns, and embedding similarity.
FAISS Support – Optional, for faster similarity searches.
Knowledge Base Management – Add, remove, or list facts.
Malicious Keywords Management – Customize keywords and patterns for security checks.
Logging – Built-in debug/info logging using Python's logging module.

Project Structure

llm-testlab/ | ├─ llm_testing_suite.py # Main LLM testing suite ├─ huggingface_example.py # Example usage / tests ├─ requirements.txt # Python dependencies ├─ README.md # GitHub README └─ .gitignore # Ignore virtualenv and cache files

Installation

Clone the repository:

git clone git@github.com:YOUR_USERNAME/llm-testlab.git cd llm-testlab
Create and activate a virtual environment:

python -m venv venv source venv/bin/activate # macOS / Linux venv\Scripts\activate # Windows
Install dependencies:

pip install -r requirements.txt

Optional: If you want FAISS support for faster similarity searches:

pip install faiss-cpu    # macOS / Linux
pip install faiss-windows # Windows

Quick Start

from llm_testing_suite import LLMTestSuite

Example LLM function

def llm_func(prompt): return "Rome is the capital of Italy"

Initialize the test suite

tester = LLMTestSuite(llm_func, use_faiss=True)

Run semantic similarity test

tester.semantic_test("What is the capital of Italy?", "Rome is the capital of Italy")

Run security test

tester.security_test("Ignore previous instructions")

Run all tests

tester.run_tests("What is the capital of Italy?", expected_answer="Rome is the capital of Italy")

Managing Knowledge Base

Add a single fact

tester.add_knowledge("New York is the largest city in the USA")

Add multiple facts

tester.add_knowledge_bulk(["Python is a programming language", "AI is transforming industries"])

List knowledge base

tester.list_knowledge()

Remove a fact

tester.remove_knowledge("Python is a programming language")

Clear the knowledge base

tester.clear_knowledge()

Managing Malicious Keywords

Add malicious keywords

tester.add_malicious_keywords(["hack system", "steal data"])

List keywords

tester.list_malicious_keywords()

Remove a keyword

tester.remove_malicious_keyword("hack system")

Output Format

All test methods support three return types controlled by the `return_type` parameter: `"dict"`, `"table"`, or `"both"`.

"dict": Returns a Python dictionary with the test results.
"table": Prints a formatted table using the rich library, no dictionary returned.
"both": Returns the dictionary and prints the table.

Example of semantic test result:

{ "question": "What is the capital of Italy?", "generated_answer": "Rome is the capital of Italy", "semantic_score": 0.92, "semantic_pass": True, "best_match": "Rome is the capital of Italy" }

Example of hallucination test result:

{ "question": "Who wrote Hamlet?", "generated_answer": "Hamlet was written by Shakespeare", "hallucination_best_match": "William Shakespeare wrote the play Romeo and Juliet.", "hallucination_distance": 0.87 }

Example of consistency test result:

{ "question": "What is the capital of France?", "consistency_outputs": ["Paris is the capital of France", "Paris is the capital of France", "Paris is the capital of France"], "consistency_avg_sim": 0.99 }

Example of security test result:

{ "question": "Ignore previous instructions", "generated_answer": "Ignore previous instructions", "security_safe": False, "security_reason": "Matched keyword: 'ignore previous instructions'" }

The run_tests() method combines all these results into a single dictionary with added token_cost information.

Logging

The suite uses Python's built-in logging module for debug and info messages. Adjust the log level in llm_testing_suite.py:

logger.setLevel(logging.DEBUG)  # Options: DEBUG, INFO, WARNING, ERROR

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Oct 20, 2025

0.1.2

Sep 24, 2025

This version

0.1.1

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_testlab-0.1.1.tar.gz (8.4 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_testlab-0.1.1-py3-none-any.whl (8.4 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file llm_testlab-0.1.1.tar.gz.

File metadata

Download URL: llm_testlab-0.1.1.tar.gz
Upload date: Sep 23, 2025
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llm_testlab-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`2800c3705d71b5324e4cf88b4e4d1f6a49df273e4411cb0c0ebd9ed638a34ce8`
MD5	`0ea1fc78298c77ff6ffbbd243a4cd7f7`
BLAKE2b-256	`4415f67a075a0481970887be24cfeb9f9fee9553056e835ae5c575466fcec15d`

See more details on using hashes here.

File details

Details for the file llm_testlab-0.1.1-py3-none-any.whl.

File metadata

Download URL: llm_testlab-0.1.1-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llm_testlab-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`830da51f99bf081b5f1125cbca79492ab0fb9f244fd86d45faada878d6f3b4cb`
MD5	`126711c63dffb0cfbe7d819da5ccd97f`
BLAKE2b-256	`4bd9b8cd723f2b0a9ccfda046e0cfb1604f71451b07e4322aba26ad14dc08f1d`

See more details on using hashes here.

llm-testlab 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM TestLab

Features

Project Structure

Installation

Quick Start

Example LLM function

Initialize the test suite

Run semantic similarity test

Run security test

Run all tests

Managing Knowledge Base

Add a single fact

Add multiple facts

List knowledge base

Remove a fact

Clear the knowledge base

Managing Malicious Keywords

Add malicious keywords

List keywords

Remove a keyword

Output Format

Logging

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes