A Python library for knit_space operations.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

KnitSpace LLM Ranker: Automated LLM Testing Harness

KnitSpace is an automated testing harness designed to evaluate and compare the capabilities of various Large Language Models (LLMs) across a diverse set of tasks. It provides a comprehensive framework for researchers and developers to assess LLM performance in areas such as problem-solving, knowledge retrieval, coding proficiency, and safety.

🔑 Key Features

Multi-LLM Support: Integrates with OpenAI, Google, Cohere, Mistral, and more.
Diverse Test Suite: Includes mathematical reasoning, coding tasks, knowledge tests (MMLU), long-context, instruction-following, and obfuscation-based tests.
Elo Rating System: Scores models using task difficulty and a cognitive cost metric ("S-value") for nuanced benchmarking.
Secure Code Execution: Uses Docker containers to safely execute LLM-generated Python/JS code.
Text Obfuscation: Tests reasoning under character-mapped distortions.
Interactive Review: Launch a web-based viewer for test results.
Extensible: Easily add new LLM providers and new types of tests.

🧱 Core Components

📁 `knit_space/models.py`

Unified interface for all LLM providers.
Abstract Model class + subclasses like OpenAIModel, GeminiModel, etc.
Manages API initialization, inference calls, and model metadata.

📁 `knit_space/tests/`

Contains all test definitions.
base.py defines:
- QAItem: A test prompt, answer, and scoring logic.
- AbstractQATest: Base class for all test sets.
- TestRegistry: Auto-discovers test modules.
Includes test types: math, coding, chess, long-context, MMLU, etc.

📁 `knit_space/marker.py`

Evaluates model responses.
Uses QAItem scoring logic and tracks correctness.
Implements Elo scoring using both test difficulty and S-value.
Launches Flask server to review test results interactively.

📁 `knit_space/utils/code_executor.py`

Runs Python and JS code from models inside Docker safely.
Accepts test cases (input/output pairs) for correctness validation.

📁 `knit_space/obscurers/`

Tools for generating challenging input variants.
CharObfuscator: Replaces characters using a bijective map to test reasoning under noise.

🐍 `verify-auto.py`

Main script to run tests.
Configures model, loads test classes, and executes tests.
Starts web server for results review.

⚙️ Setup

1. Prerequisites

Python 3.8+
Docker (for coding tasks)
Git

2. Installation

git clone [<repository_url>](https://github.com/C-you-know/Action-Based-LLM-Testing-Harness)
cd KnitSpace-LLM-Ranker

python -m venv venv
source venv/bin/activate  # (Windows: venv\Scripts\activate)

pip install -r requirements.txt  # Or manually install dependencies

3. API Key Setup

Set the following environment variables based on the providers you wish to use:

export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export COHERE_API_KEY="..."
# Cloudflare-specific
export CLOUDFLARE_API_KEY="..."
export CLOUDFLARE_ACCOUNT_ID="..."

🚀 Running Tests

Run via `verify-auto.py`

Configure:
- Choose model/provider in verify-auto.py
- Select tests in test_cases list
Run:
```
python verify-auto.py
```
View:
- Console logs test stats
- Web UI opens at http://localhost:8000

Debug Test Inputs (optional)

Use QA-test.py to inspect generated test data without invoking an LLM:

python QA-test.py

🔌 Extending the Harness

➕ Adding New LLM Providers

Subclass Model in knit_space/models.py
Implement:
- _initialize_client()
- inference(...)
Update:
- PROVIDER_CLASS_MAP
- _get_api_key_for_provider() and optionally _list_api_models()

🧪 Adding New Test Types

Create a new file in knit_space/tests/
Subclass AbstractQATest
Implement generate() to yield QAItems
Optionally register using @register_test()

📦 Install as a Package

You can also install this project as a pip package (once published):

pip install ks-llm-ranker

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.6

Jun 12, 2025

0.1.5

Jun 12, 2025

0.1.4

Jun 12, 2025

0.1.3

Jun 12, 2025

0.1.2

Jun 12, 2025

This version

0.1.1

Jun 12, 2025

0.1.0

Jun 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ks_llm_ranker-0.1.1.tar.gz (28.6 kB view details)

Uploaded Jun 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ks_llm_ranker-0.1.1-py3-none-any.whl (27.8 kB view details)

Uploaded Jun 12, 2025 Python 3

File details

Details for the file ks_llm_ranker-0.1.1.tar.gz.

File metadata

Download URL: ks_llm_ranker-0.1.1.tar.gz
Upload date: Jun 12, 2025
Size: 28.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for ks_llm_ranker-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6e12c8d093729485ce5390411d01c2707334e5caa5f6e6b15f6ad29a51134b0d`
MD5	`fcced1a2063dd5233dd7f856b25ab64b`
BLAKE2b-256	`b1a3512c775cea52a15e947ee522a7f029935dece98a634a0a19456bdc565ac7`

See more details on using hashes here.

File details

Details for the file ks_llm_ranker-0.1.1-py3-none-any.whl.

File metadata

Download URL: ks_llm_ranker-0.1.1-py3-none-any.whl
Upload date: Jun 12, 2025
Size: 27.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for ks_llm_ranker-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab06025b590a76887dec81b5a68dbc378b99d2ae13ac0c5a678ac42585d092a0`
MD5	`c967af84d9974a7a119d1e86e245037b`
BLAKE2b-256	`9e5af85d870a5a021fe7de3dd2243bfb5fbc394f6276451369a680d759e7be20`

See more details on using hashes here.

ks-llm-ranker 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

KnitSpace LLM Ranker: Automated LLM Testing Harness

🔑 Key Features

🧱 Core Components

📁 knit_space/models.py

📁 knit_space/tests/

📁 knit_space/marker.py

📁 knit_space/utils/code_executor.py

📁 knit_space/obscurers/

🐍 verify-auto.py

⚙️ Setup

1. Prerequisites

2. Installation

3. API Key Setup

🚀 Running Tests

Run via verify-auto.py

Debug Test Inputs (optional)

🔌 Extending the Harness

➕ Adding New LLM Providers

🧪 Adding New Test Types

📦 Install as a Package

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📁 `knit_space/models.py`

📁 `knit_space/tests/`

📁 `knit_space/marker.py`

📁 `knit_space/utils/code_executor.py`

📁 `knit_space/obscurers/`

🐍 `verify-auto.py`

Run via `verify-auto.py`