Skip to main content

Simple, Pythonic building blocks to evaluate LLM-based applications

Project description

LangCheck Logo LangCheck Logo

Simple, Pythonic building blocks to evaluate LLM applications.

InstallExamplesDocs日本語

Install

pip install langcheck

Examples

Evaluate Text

Use LangCheck's suite of metrics to evaluate LLM-generated text.

import langcheck

# Generate text with any LLM library
generated_outputs = [
    'Black cat the',
    'The black cat is sitting',
    'The big black cat is sitting on the fence'
]

# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.eval.fluency(generated_outputs) > 0.5

EvalValueWithThreshold screenshot

It's easy to turn LangCheck metrics into unit tests, just use assert:

assert langcheck.eval.fluency(generated_outputs) > 0.5

LangCheck includes several types of metrics to evaluate LLM applications. Some examples:

# 1. Reference-Free Text Quality Metrics
langcheck.eval.toxicity(generated_outputs)
langcheck.eval.fluency(generated_outputs)
langcheck.eval.sentiment(generated_outputs)
langcheck.eval.flesch_kincaid_grade(generated_outputs)

# 2. Reference-Based Text Quality Metrics
langcheck.eval.factual_consistency(generated_outputs, reference_outputs)
langcheck.eval.semantic_sim(generated_outputs, reference_outputs)
langcheck.eval.rouge2(generated_outputs, reference_outputs)
langcheck.eval.exact_match(generated_outputs, reference_outputs)

# 3. Text Structure Metrics
langcheck.eval.is_int(generated_outputs, domain=range(1, 6))
langcheck.eval.is_float(generated_outputs, min=0, max=None)
langcheck.eval.is_json_array(generated_outputs)
langcheck.eval.is_json_object(generated_outputs)
langcheck.eval.contains_regex(generated_outputs, r"\d{5,}")
langcheck.eval.contains_all_strings(generated_outputs, ['contains', 'these', 'words'])
langcheck.eval.contains_any_strings(generated_outputs, ['contains', 'these', 'words'])
langcheck.eval.validation_fn(generated_outputs, lambda x: 'myKey' in json.loads(x))

Some LangCheck metrics support using the OpenAI API. To use the OpenAI option, make sure to set the API key:

import openai
from langcheck.eval.en import semantic_sim

# https://platform.openai.com/account/api-keys
openai.api_key = YOUR_OPENAI_API_KEY

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]
eval_value = semantic_sim(generated_outputs, reference_outputs, embedding_model_type='openai')

Or, if you're using the Azure API type, make sure to set all of the necessary variables:

import openai
from langcheck.eval.en import semantic_sim

openai.api_type = 'azure'
openai.api_base = YOUR_AZURE_OPENAI_ENDPOINT
openai.api_version = YOUR_API_VERSION
openai.api_key = YOUR_OPENAI_API_KEY

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]

# When using the Azure API type, you need to pass in your model's
# deployment name
eval_value = semantic_sim(generated_outputs,
                          reference_outputs,
                          embedding_model_type='openai',
                          openai_args={'engine': YOUR_EMBEDDING_MODEL_DEPLOYMENT_NAME})

Visualize Metrics

LangCheck comes with built-in, interactive visualizations of metrics.

# Choose some metrics
fluency_values = langcheck.eval.fluency(generated_outputs)
sentiment_values = langcheck.eval.sentiment(generated_outputs)

# Interactive scatter plot of one metric
fluency_values.scatter()

Scatter plot for one metric

# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)

Scatter plot for two metrics

# Interactive histogram of a single metric
fluency_values.histogram()

Histogram for one metric

Augment Data (coming soon)

more_prompts = []
more_prompts += langcheck.augment.keyboard_typo(prompts)
more_prompts += langcheck.augment.ocr_typo(prompts)
more_prompts += langcheck.augment.synonym(prompts)
more_prompts += langcheck.augment.gender(prompts, to_gender='male')
more_prompts += langcheck.augment.gpt35_rewrite(prompts)

Building Blocks for Monitoring

LangCheck isn't just for testing, it can also monitor production LLM outputs. Just save the outputs and pass them into LangCheck.

from langcheck.utils import load_json

recorded_outputs = load_json('llm_logs_2023_10_02.json')['outputs']
langcheck.eval.toxicity(recorded_outputs) < 0.25
langcheck.eval.is_json_array(recorded_outputs)

Building Blocks for Guardrails

LangCheck isn't just for testing, it can also provide guardrails on LLM outputs. Just filter candidate outputs through LangCheck.

raw_output = my_llm_app(random_user_prompt)
while langcheck.eval.contains_any_strings([raw_output], blacklist_words).any():
    raw_output = my_llm_app(random_user_prompt)

Docs

Link to ReadTheDocs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langcheck-0.0.6.tar.gz (30.3 kB view hashes)

Uploaded Source

Built Distribution

langcheck-0.0.6-py3-none-any.whl (37.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page