Skip to main content

Simple, Pythonic building blocks to evaluate LLM-based applications

Project description

LangCheck Logo LangCheck Logo

Pytest Tests Downloads GitHub

Simple, Pythonic building blocks to evaluate LLM applications.

InstallExamplesQuickstartDocs日本語中文Deutsch

Install

# Install English metrics only
pip install langcheck

# Install English and Japanese metrics
pip install langcheck[ja]

# Install metrics for all languages (requires pip 21.2+)
pip install --upgrade pip
pip install langcheck[all]

Having installation issues? See the FAQ.

Examples

Evaluate Text

Use LangCheck's suite of metrics to evaluate LLM-generated text.

import langcheck

# Generate text with any LLM library
generated_outputs = [
    'Black cat the',
    'The black cat is sitting',
    'The big black cat is sitting on the fence'
]

# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.metrics.fluency(generated_outputs) > 0.5

MetricValueWithThreshold screenshot

It's easy to turn LangCheck metrics into unit tests, just use assert:

assert langcheck.metrics.fluency(generated_outputs) > 0.5

LangCheck includes several types of metrics to evaluate LLM applications. Some examples:

Type of Metric Examples Languages
Reference-Free Text Quality Metrics toxicity(generated_outputs)
sentiment(generated_outputs)
ai_disclaimer_similarity(generated_outputs)
EN, JA, ZH, DE
Reference-Based Text Quality Metrics semantic_similarity(generated_outputs, reference_outputs)
rouge2(generated_outputs, reference_outputs)
EN, JA, ZH, DE
Source-Based Text Quality Metrics factual_consistency(generated_outputs, sources) EN, JA, ZH, DE
Query-Based Text Quality Metrics answer_relevance(generated_outputs, prompts) EN, JA
Text Structure Metrics is_float(generated_outputs, min=0, max=None)
is_json_object(generated_outputs)
All Languages
Pairwise Text Quality Metrics pairwise_comparison(generated_outputs_a, generated_outputs_b, prompts) EN, JA

Visualize Metrics

LangCheck comes with built-in, interactive visualizations of metrics.

# Choose some metrics
fluency_values = langcheck.metrics.fluency(generated_outputs)
sentiment_values = langcheck.metrics.sentiment(generated_outputs)

# Interactive scatter plot of one metric
fluency_values.scatter()

Scatter plot for one metric

# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)

Scatter plot for two metrics

# Interactive histogram of a single metric
fluency_values.histogram()

Histogram for one metric

Augment Data

Text augmentations can automatically generate reworded prompts, typos, gender changes, and more to evaluate model robustness.

For example, to measure how the model responds to different genders:

male_prompts = langcheck.augment.gender(prompts, to_gender='male')
female_prompts = langcheck.augment.gender(prompts, to_gender='female')

male_generated_outputs = [my_llm_app(prompt) for prompt in male_prompts]
female_generated_outputs = [my_llm_app(prompt) for prompt in female_prompts]

langcheck.metrics.sentiment(male_generated_outputs)
langcheck.metrics.sentiment(female_generated_outputs)

Unit Testing

You can write test cases for your LLM application using LangCheck metrics.

For example, if you only have a list of prompts to test against:

from langcheck.utils import load_json

# Run the LLM application once to generate text
prompts = load_json('test_prompts.json')
generated_outputs = [my_llm_app(prompt) for prompt in prompts]

# Unit tests
def test_toxicity(generated_outputs):
    assert langcheck.metrics.toxicity(generated_outputs) < 0.1

def test_fluency(generated_outputs):
    assert langcheck.metrics.fluency(generated_outputs) > 0.9

def test_json_structure(generated_outputs):
    assert langcheck.metrics.validation_fn(
        generated_outputs, lambda x: 'myKey' in json.loads(x)).all()

Monitoring

You can monitor the quality of your LLM outputs in production with LangCheck metrics.

Just save the outputs and pass them into LangCheck.

production_outputs = load_json('llm_logs_2023_10_02.json')['outputs']

# Evaluate and display toxic outputs in production logs
langcheck.metrics.toxicity(production_outputs) > 0.75

# Or if your app outputs structured text
langcheck.metrics.is_json_array(production_outputs)

Guardrails

You can provide guardrails on LLM outputs with LangCheck metrics.

Just filter candidate outputs through LangCheck.

# Get a candidate output from the LLM app
raw_output = my_llm_app(random_user_prompt)

# Filter the output before it reaches the user
while langcheck.metrics.contains_any_strings(raw_output, blacklist_words).any():
    raw_output = my_llm_app(random_user_prompt)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langcheck-0.8.0.tar.gz (101.9 kB view details)

Uploaded Source

Built Distribution

langcheck-0.8.0-py3-none-any.whl (167.9 kB view details)

Uploaded Python 3

File details

Details for the file langcheck-0.8.0.tar.gz.

File metadata

  • Download URL: langcheck-0.8.0.tar.gz
  • Upload date:
  • Size: 101.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for langcheck-0.8.0.tar.gz
Algorithm Hash digest
SHA256 fca18a01b766f2cf2d9793d7398e900119c60d2f9cd50b937a8751d0fffd99a2
MD5 c9350c2629ade7ea9f93371751fbbb41
BLAKE2b-256 a483175f6203c5cf1583fb01ab6ef0ed244136fbd5bf87e8930a563995ae2a78

See more details on using hashes here.

File details

Details for the file langcheck-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: langcheck-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 167.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for langcheck-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e7de0581d76def0a47dd3f274b7e951a502aa0a84c8086f5a1afecdfa3b6a604
MD5 c75a310b9ba30bf9b1890d873644998e
BLAKE2b-256 c128575365f941b11eefa91a2a7b4bb1d8dc25d79ffcdea16fae0edf19ee41be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page