Simple, Pythonic building blocks to evaluate LLM-based applications
Project description
Simple, Pythonic building blocks to evaluate LLM applications.
Install • Examples • Quickstart • Docs • 日本語
Install
pip install langcheck
Examples
Evaluate Text
Use LangCheck's suite of metrics to evaluate LLM-generated text.
import langcheck
# Generate text with any LLM library
generated_outputs = [
'Black cat the',
'The black cat is sitting',
'The big black cat is sitting on the fence'
]
# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.metrics.fluency(generated_outputs) > 0.5
It's easy to turn LangCheck metrics into unit tests, just use assert
:
assert langcheck.metrics.fluency(generated_outputs) > 0.5
LangCheck includes several types of metrics to evaluate LLM applications. Some examples:
Type of Metric | Examples | Languages |
---|---|---|
Reference-Free Text Quality Metrics | toxicity(generated_outputs) sentiment(generated_outputs) ai_disclaimer_similarity(generated_outputs) |
EN, JA |
Reference-Based Text Quality Metrics | semantic_similarity(generated_outputs, reference_outputs) rouge2(generated_outputs, reference_outputs) |
EN, JA |
Source-Based Text Quality Metrics | factual_consistency(generated_outputs, sources) |
EN, JA |
Text Structure Metrics | is_float(generated_outputs, min=0, max=None) is_json_object(generated_outputs) |
All Languages |
Visualize Metrics
LangCheck comes with built-in, interactive visualizations of metrics.
# Choose some metrics
fluency_values = langcheck.metrics.fluency(generated_outputs)
sentiment_values = langcheck.metrics.sentiment(generated_outputs)
# Interactive scatter plot of one metric
fluency_values.scatter()
# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)
# Interactive histogram of a single metric
fluency_values.histogram()
Augment Data (coming soon)
more_prompts = []
more_prompts += langcheck.augment.keyboard_typo(prompts)
more_prompts += langcheck.augment.ocr_typo(prompts)
more_prompts += langcheck.augment.synonym(prompts)
more_prompts += langcheck.augment.gender(prompts, to_gender='male')
more_prompts += langcheck.augment.gpt35_rewrite(prompts)
Unit Testing
You can write test cases for your LLM application using LangCheck metrics.
For example, if you only have a list of prompts to test against:
from langcheck.utils import load_json
# Run the LLM application once to generate text
prompts = load_json('test_prompts.json')
generated_outputs = [my_llm_app(prompt) for prompt in prompts]
# Unit tests
def test_toxicity(generated_outputs):
assert langcheck.metrics.toxicity(generated_outputs) < 0.1
def test_fluency(generated_outputs):
assert langcheck.metrics.fluency(generated_outputs) > 0.9
def test_json_structure(generated_outputs):
assert langcheck.metrics.validation_fn(
generated_outputs, lambda x: 'myKey' in json.loads(x)).all()
Monitoring
You can monitor the quality of your LLM outputs in production with LangCheck metrics.
Just save the outputs and pass them into LangCheck.
production_outputs = load_json('llm_logs_2023_10_02.json')['outputs']
# Evaluate and display toxic outputs in production logs
langcheck.metrics.toxicity(production_outputs) > 0.75
# Or if your app outputs structured text
langcheck.metrics.is_json_array(production_outputs)
Guardrails
You can provide guardrails on LLM outputs with LangCheck metrics.
Just filter candidate outputs through LangCheck.
# Get a candidate output from the LLM app
raw_output = my_llm_app(random_user_prompt)
# Filter the output before it reaches the user
while langcheck.metrics.contains_any_strings(raw_output, blacklist_words).any():
raw_output = my_llm_app(random_user_prompt)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for langcheck-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a2ec88dfbe40a52145d9c8b75afd76acb02dcc9a505625caa9ba894a4d56a08 |
|
MD5 | fff5b3c846363e1c69fbc3f2c1b15a31 |
|
BLAKE2b-256 | 00b97f25b6a30f7c8de920e5224e11567ea994c2b1bffbb60997aaa8587b3cea |