A Python library for knit_space operations.
Project description
KnitSpace LLM Ranker: Automated LLM Testing Harness
KnitSpace is an automated testing harness designed to evaluate and compare the capabilities of various Large Language Models (LLMs) across a diverse set of tasks. It provides a comprehensive framework for researchers and developers to assess LLM performance in areas such as problem-solving, knowledge retrieval, coding proficiency, and safety.
🔑 Key Features
- Multi-LLM Support: Integrates with OpenAI, Google, Cohere, Mistral, and more.
- Diverse Test Suite: Includes mathematical reasoning, coding tasks, knowledge tests (MMLU), long-context, instruction-following, and obfuscation-based tests.
- Elo Rating System: Scores models using task difficulty and a cognitive cost metric ("S-value") for nuanced benchmarking.
- Secure Code Execution: Uses Docker containers to safely execute LLM-generated Python/JS code.
- Text Obfuscation: Tests reasoning under character-mapped distortions.
- Interactive Review: Launch a web-based viewer for test results.
- Extensible: Easily add new LLM providers and new types of tests.
🧱 Core Components
📁 knit_space/models.py
- Unified interface for all LLM providers.
- Abstract
Modelclass + subclasses likeOpenAIModel,GeminiModel, etc. - Manages API initialization, inference calls, and model metadata.
📁 knit_space/tests/
- Contains all test definitions.
base.pydefines:QAItem: A test prompt, answer, and scoring logic.AbstractQATest: Base class for all test sets.TestRegistry: Auto-discovers test modules.
- Includes test types: math, coding, chess, long-context, MMLU, etc.
📁 knit_space/marker.py
- Evaluates model responses.
- Uses
QAItemscoring logic and tracks correctness. - Implements Elo scoring using both test difficulty and S-value.
- Launches
Flaskserver to review test results interactively.
📁 knit_space/utils/code_executor.py
- Runs Python and JS code from models inside Docker safely.
- Accepts test cases (input/output pairs) for correctness validation.
📁 knit_space/obscurers/
- Tools for generating challenging input variants.
CharObfuscator: Replaces characters using a bijective map to test reasoning under noise.
🐍 verify-auto.py
- Main script to run tests.
- Configures model, loads test classes, and executes tests.
- Starts web server for results review.
⚙️ Setup
1. Prerequisites
- Python 3.8+
- Docker (for coding tasks)
- Git
2. Installation
git clone [<repository_url>](https://github.com/C-you-know/Action-Based-LLM-Testing-Harness)
cd KnitSpace-LLM-Ranker
python -m venv venv
source venv/bin/activate # (Windows: venv\Scripts\activate)
pip install -r requirements.txt # Or manually install dependencies
3. API Key Setup
Set the following environment variables based on the providers you wish to use:
export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export COHERE_API_KEY="..."
# Cloudflare-specific
export CLOUDFLARE_API_KEY="..."
export CLOUDFLARE_ACCOUNT_ID="..."
🚀 Running Tests
Run via verify-auto.py
-
Configure:
- Choose model/provider in
verify-auto.py - Select tests in
test_caseslist
- Choose model/provider in
-
Run:
python verify-auto.py -
View:
- Console logs test stats
- Web UI opens at
http://localhost:8000
Debug Test Inputs (optional)
Use QA-test.py to inspect generated test data without invoking an LLM:
python QA-test.py
🔌 Extending the Harness
➕ Adding New LLM Providers
-
Subclass
Modelinknit_space/models.py -
Implement:
_initialize_client()inference(...)
-
Update:
PROVIDER_CLASS_MAP_get_api_key_for_provider()and optionally_list_api_models()
🧪 Adding New Test Types
- Create a new file in
knit_space/tests/ - Subclass
AbstractQATest - Implement
generate()to yieldQAItems - Optionally register using
@register_test()
📦 Install as a Package
You can also install this project as a pip package (once published):
pip install ks-llm-ranker
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ks_llm_ranker-0.1.6.tar.gz.
File metadata
- Download URL: ks_llm_ranker-0.1.6.tar.gz
- Upload date:
- Size: 83.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fea26408b1dd5ca55c695cdadb096861d5236841650025fdd1c9ed512d4b9ef
|
|
| MD5 |
4b5fda746b3d9001b04b3b4545345d53
|
|
| BLAKE2b-256 |
0814d5f72e684502bec83bac3848dc6224e71379f85a2f8f39188531ce9eacb6
|
File details
Details for the file ks_llm_ranker-0.1.6-py3-none-any.whl.
File metadata
- Download URL: ks_llm_ranker-0.1.6-py3-none-any.whl
- Upload date:
- Size: 104.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51392c9851743eedb3621869fdead52732320aed3c2edbc39f58c4dd29357c21
|
|
| MD5 |
f22f47600c912d39a980783fd71a8d7c
|
|
| BLAKE2b-256 |
b02ea726fc4052385374838f8e24c4ed71c117c5eaab5cb3536a52a6a6472732
|