An open-source toolkit to test LLMs against jailbreaks and unprecedented harms.
Project description
walledeval
Test LLMs against jailbreaks and unprecedented harms
WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.
Install
pip install walledeval
Basic Usage
LLMs (walledeval.llm
)
We support the following LLM types:
Class | LLM Type |
---|---|
HF_LLM(id, system_prompt = "") |
Any HuggingFace LLM that supports Text Generation, specified withid parameter. |
Claude(api_key, system_prompt = "") |
Claude 3 Opus |
Usage is as follows:
>>> from walledeval.llm import HF_LLM, Claude
>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>
>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>
A custom abstract llm.LLM
class is also defined to support other LLMs, which takes in the model identifier name
and optional system prompt system_prompt
, and an abstract method generate(text: str) -> str
.
Judges (walledeval.judge
)
Judges are used to identify if outputs are malignant. We currently support the judge ClaudeJudge
, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns False
if malignant (i.e. it didn't pass the test).
Usage is as follows:
>>> from walledeval.judge import ClaudeJudge
>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>
A custom abstract judge.Judge
class is also defined to support other possible judges, which takes in the judge identifier name
, and an abstract method check(text: str) -> bool
.
Benchmarks (walledeval.benchmark
)
Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:
Benchmark Name | Class |
---|---|
WMDP Benchmark | WMDP |
Usage is as follows:
>>> from walledeval.benchmark import WMDP
>>> wmdp = WMDP()
>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]
A custom abstract benchmark.Benchmark
class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for walledeval-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 987f66d215465ab4c8564ccb6040f91b4bb603511587d21349779ce07bce64c1 |
|
MD5 | 5acbc48ce01c2eaac0f5ade52a3917a9 |
|
BLAKE2b-256 | 754e489f5963dfeca701d0f1381fc23957c7e8c592640fd0db4a43d8350e1665 |