langcheck

Simple, Pythonic building blocks to evaluate LLM-based applications

These details have not been verified by PyPI

Project links

repository

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Simple, Pythonic building blocks to evaluate LLM applications.

Install • Examples • Docs • 日本語

Install

pip install langcheck

Examples

Evaluate Text

Use LangCheck's suite of metrics to evaluate LLM-generated text.

import langcheck

# Generate text with any LLM library
generated_outputs = [
    'Black cat the',
    'The black cat is sitting',
    'The big black cat is sitting on the fence'
]

# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.eval.fluency(generated_outputs) > 0.5

EvalValueWithThreshold screenshot

It's easy to turn LangCheck metrics into unit tests, just use assert:

assert langcheck.eval.fluency(generated_outputs) > 0.5

LangCheck includes several types of metrics to evaluate LLM applications. Some examples:

# 1. Reference-Free Text Quality Metrics
langcheck.eval.toxicity(generated_outputs)
langcheck.eval.fluency(generated_outputs)
langcheck.eval.sentiment(generated_outputs)
langcheck.eval.flesch_kincaid_grade(generated_outputs)

# 2. Reference-Based Text Quality Metrics
langcheck.eval.factual_consistency(generated_outputs, reference_outputs)
langcheck.eval.semantic_sim(generated_outputs, reference_outputs)
langcheck.eval.rouge2(generated_outputs, reference_outputs)
langcheck.eval.exact_match(generated_outputs, reference_outputs)

# 3. Text Structure Metrics
langcheck.eval.is_int(generated_outputs, domain=range(1, 6))
langcheck.eval.is_float(generated_outputs, min=0, max=None)
langcheck.eval.is_json_array(generated_outputs)
langcheck.eval.is_json_object(generated_outputs)
langcheck.eval.contains_regex(generated_outputs, r"\d{5,}")
langcheck.eval.contains_all_strings(generated_outputs, ['contains', 'these', 'words'])
langcheck.eval.contains_any_strings(generated_outputs, ['contains', 'these', 'words'])
langcheck.eval.validation_fn(generated_outputs, lambda x: 'myKey' in json.loads(x))

Some LangCheck metrics support using the OpenAI API. To use the OpenAI option, make sure to set the API key:

import openai
from langcheck.eval.en import semantic_sim

# https://platform.openai.com/account/api-keys
openai.api_key = YOUR_OPENAI_API_KEY

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]
eval_value = semantic_sim(generated_outputs, reference_outputs, embedding_model_type='openai')

Or, if you're using the Azure API type, make sure to set all of the necessary variables:

import openai
from langcheck.eval.en import semantic_sim

openai.api_type = 'azure'
openai.api_base = YOUR_AZURE_OPENAI_ENDPOINT
openai.api_version = YOUR_API_VERSION
openai.api_key = YOUR_OPENAI_API_KEY

generated_outputs = ["The cat is sitting on the mat."]
reference_outputs = ["The cat sat on the mat."]

# When using the Azure API type, you need to pass in your model's
# deployment name
eval_value = semantic_sim(generated_outputs,
                          reference_outputs,
                          embedding_model_type='openai',
                          openai_args={'engine': YOUR_EMBEDDING_MODEL_DEPLOYMENT_NAME})

Visualize Metrics

LangCheck comes with built-in, interactive visualizations of metrics.

# Choose some metrics
fluency_values = langcheck.eval.fluency(generated_outputs)
sentiment_values = langcheck.eval.sentiment(generated_outputs)

# Interactive scatter plot of one metric
fluency_values.scatter()

Scatter plot for one metric

# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)

Scatter plot for two metrics

# Interactive histogram of a single metric
fluency_values.histogram()

Histogram for one metric

Augment Data (coming soon)

more_prompts = []
more_prompts += langcheck.augment.keyboard_typo(prompts)
more_prompts += langcheck.augment.ocr_typo(prompts)
more_prompts += langcheck.augment.synonym(prompts)
more_prompts += langcheck.augment.gender(prompts, to_gender='male')
more_prompts += langcheck.augment.gpt35_rewrite(prompts)

Building Blocks for Monitoring

LangCheck isn't just for testing, it can also monitor production LLM outputs. Just save the outputs and pass them into LangCheck.

from langcheck.utils import load_json

recorded_outputs = load_json('llm_logs_2023_10_02.json')['outputs']
langcheck.eval.toxicity(recorded_outputs) < 0.25
langcheck.eval.is_json_array(recorded_outputs)

Building Blocks for Guardrails

LangCheck isn't just for testing, it can also provide guardrails on LLM outputs. Just filter candidate outputs through LangCheck.

raw_output = my_llm_app(random_user_prompt)
while langcheck.eval.contains_any_strings([raw_output], blacklist_words).any():
    raw_output = my_llm_app(random_user_prompt)

Docs

Link to ReadTheDocs.

Project details

These details have not been verified by PyPI

Project links

repository

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.0.dev1 pre-release

May 14, 2024

0.7.1

May 8, 2024

0.7.0 yanked

May 8, 2024

Reason this release was yanked:

Bug affects OpenAI JA and DE metrics

0.6.0

Apr 8, 2024

0.5.0

Mar 11, 2024

0.4.0

Jan 22, 2024

0.3.0

Dec 6, 2023

0.2.0

Nov 8, 2023

0.1.0

Oct 11, 2023

This version

0.0.6

Oct 10, 2023

0.0.5

Sep 27, 2023

0.0.4

Sep 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langcheck-0.0.6.tar.gz (30.3 kB view hashes)

Uploaded Oct 10, 2023 Source

Built Distribution

langcheck-0.0.6-py3-none-any.whl (37.4 kB view hashes)

Uploaded Oct 10, 2023 Python 3

Hashes for langcheck-0.0.6.tar.gz

Hashes for langcheck-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`87e72cd3c2c7ee241cc86037625f75d79c2919b5949c549a1e4f2e8f73f3e8bd`
MD5	`fa061eeb16486fa4f5ea00d0a2698d76`
BLAKE2b-256	`8438aaaf8290dc762d5301730010755af73fd818ca40d46e02b3e0dd81837a4f`

Hashes for langcheck-0.0.6-py3-none-any.whl

Hashes for langcheck-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa90aec4098acec218b5d3ab6b268484f1b781e7e509c8e3ec67206e84406b44`
MD5	`d2e54127a69ca9146cabc3640b45e39e`
BLAKE2b-256	`fb5c50aec5ff16736a7ea286245bf4cf515d430471b4f03445160163fa8649e8`