Simple, Pythonic building blocks to evaluate LLM-based applications
Project description
Simple, Pythonic building blocks to evaluate LLM applications.
Install
# Install English metrics only
pip install langcheck
# Install English and Japanese metrics
pip install langcheck[ja]
# Install metrics for all languages (requires pip 21.2+)
pip install --upgrade pip
pip install langcheck[all]
Having installation issues? See the FAQ.
Examples
Evaluate Text
Use LangCheck's suite of metrics to evaluate LLM-generated text.
import langcheck
# Generate text with any LLM library
generated_outputs = [
'Black cat the',
'The black cat is sitting',
'The big black cat is sitting on the fence'
]
# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.metrics.fluency(generated_outputs) > 0.5
It's easy to turn LangCheck metrics into unit tests, just use assert
:
assert langcheck.metrics.fluency(generated_outputs) > 0.5
LangCheck includes several types of metrics to evaluate LLM applications. Some examples:
Type of Metric | Examples | Languages |
---|---|---|
Reference-Free Text Quality Metrics | toxicity(generated_outputs) sentiment(generated_outputs) ai_disclaimer_similarity(generated_outputs) |
EN, JA, ZH, DE |
Reference-Based Text Quality Metrics | semantic_similarity(generated_outputs, reference_outputs) rouge2(generated_outputs, reference_outputs) |
EN, JA, ZH, DE |
Source-Based Text Quality Metrics | factual_consistency(generated_outputs, sources) |
EN, JA, ZH, DE |
Query-Based Text Quality Metrics | answer_relevance(generated_outputs, prompts) |
EN, JA |
Text Structure Metrics | is_float(generated_outputs, min=0, max=None) is_json_object(generated_outputs) |
All Languages |
Pairwise Text Quality Metrics | pairwise_comparison(generated_outputs_a, generated_outputs_b, prompts) |
EN, JA |
Visualize Metrics
LangCheck comes with built-in, interactive visualizations of metrics.
# Choose some metrics
fluency_values = langcheck.metrics.fluency(generated_outputs)
sentiment_values = langcheck.metrics.sentiment(generated_outputs)
# Interactive scatter plot of one metric
fluency_values.scatter()
# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)
# Interactive histogram of a single metric
fluency_values.histogram()
Augment Data
Text augmentations can automatically generate reworded prompts, typos, gender changes, and more to evaluate model robustness.
For example, to measure how the model responds to different genders:
male_prompts = langcheck.augment.gender(prompts, to_gender='male')
female_prompts = langcheck.augment.gender(prompts, to_gender='female')
male_generated_outputs = [my_llm_app(prompt) for prompt in male_prompts]
female_generated_outputs = [my_llm_app(prompt) for prompt in female_prompts]
langcheck.metrics.sentiment(male_generated_outputs)
langcheck.metrics.sentiment(female_generated_outputs)
Unit Testing
You can write test cases for your LLM application using LangCheck metrics.
For example, if you only have a list of prompts to test against:
from langcheck.utils import load_json
# Run the LLM application once to generate text
prompts = load_json('test_prompts.json')
generated_outputs = [my_llm_app(prompt) for prompt in prompts]
# Unit tests
def test_toxicity(generated_outputs):
assert langcheck.metrics.toxicity(generated_outputs) < 0.1
def test_fluency(generated_outputs):
assert langcheck.metrics.fluency(generated_outputs) > 0.9
def test_json_structure(generated_outputs):
assert langcheck.metrics.validation_fn(
generated_outputs, lambda x: 'myKey' in json.loads(x)).all()
Monitoring
You can monitor the quality of your LLM outputs in production with LangCheck metrics.
Just save the outputs and pass them into LangCheck.
production_outputs = load_json('llm_logs_2023_10_02.json')['outputs']
# Evaluate and display toxic outputs in production logs
langcheck.metrics.toxicity(production_outputs) > 0.75
# Or if your app outputs structured text
langcheck.metrics.is_json_array(production_outputs)
Guardrails
You can provide guardrails on LLM outputs with LangCheck metrics.
Just filter candidate outputs through LangCheck.
# Get a candidate output from the LLM app
raw_output = my_llm_app(random_user_prompt)
# Filter the output before it reaches the user
while langcheck.metrics.contains_any_strings(raw_output, blacklist_words).any():
raw_output = my_llm_app(random_user_prompt)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file langcheck-0.8.0.tar.gz
.
File metadata
- Download URL: langcheck-0.8.0.tar.gz
- Upload date:
- Size: 101.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fca18a01b766f2cf2d9793d7398e900119c60d2f9cd50b937a8751d0fffd99a2 |
|
MD5 | c9350c2629ade7ea9f93371751fbbb41 |
|
BLAKE2b-256 | a483175f6203c5cf1583fb01ab6ef0ed244136fbd5bf87e8930a563995ae2a78 |
File details
Details for the file langcheck-0.8.0-py3-none-any.whl
.
File metadata
- Download URL: langcheck-0.8.0-py3-none-any.whl
- Upload date:
- Size: 167.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7de0581d76def0a47dd3f274b7e951a502aa0a84c8086f5a1afecdfa3b6a604 |
|
MD5 | c75a310b9ba30bf9b1890d873644998e |
|
BLAKE2b-256 | c128575365f941b11eefa91a2a7b4bb1d8dc25d79ffcdea16fae0edf19ee41be |