Skip to main content

Qualitatively evaluate LLMs

Project description

Gauge Logo

Gauge - LLM Evaluation

License: MIT Python Version PyPI version

Gauge is a Python library for evaluating and comparing language models (LLMs). Compare models based on their performance on complex and custom tasks, alongside numeric measurements like latency and cost.

How does it work?

Gauge uses a model-on-model approach to evaluate LLMs qualitatively. An advanced arbiter model (GPT-4) evaluates the performance of smaller LLMs on specific tasks, providing a numeric score based on their output. This allows users to create custom benchmarks for their tasks and obtain qualitative evaluations of different LLMs. Gauge is useful for evaluating and ranking LLMs on a wide range of complex and subjective tasks, such as creative writing, staying in character, formatting outputs, extracting information, and translating text.

Features

  • Evaluate and compare multiple LLMs using custom benchmarks
  • Straightforward API for running and evaluating LLMs
  • Extensible architecture for including additional models

Installation

To install Gauge, run the following command:

pip install gauge-llm

Before using Gauge, set your HUGGINGFACE_TOKEN environment variable, your REPLICATE_API_TOKEN, and import the openai library and set your .api_key.

import os
import openai

os.environ["HUGGINGFACE_TOKEN"] = "your_huggingface_token"
os.environ["REPLICATE_API_TOKEN"] = "your_replicate_api_token"
openai.api_key = "your_openai_api_key"

Examples

Information Extraction: Historical Event

import gauge

query = "Extract the main points from the following paragraph: On July 20, 1969, American astronauts Neil Armstrong and Buzz Aldrin became the first humans to land on the Moon. Armstrong stepped onto the lunar surface and described the event as 'one small step for man, one giant leap for mankind.'"
gauge.evaluate(query)

Staying in Character: Detective's Monologue

import gauge

query = "Write a monologue for a detective character in a film noir setting."
gauge.evaluate(query)

Translation: English to Spanish

import gauge

query = "Translate the following English text to Spanish: 'The quick brown fox jumps over the lazy dog.'"
gauge.evaluate(query)

Formatting Output: Recipe Conversion

import gauge

query = "Convert the following recipe into a shopping list: 2 cups flour, 1 cup sugar, 3 eggs, 1/2 cup milk, 1/4 cup butter."
gauge.evaluate(query)

These examples will display a table with the results for each model, including their name, response, score, explanation, latency, and cost.

API

gauge.run(model, query)

Runs the specified model with the given query and returns the output, latency, and cost.

Parameters:

  • model: A dictionary containing the model's information (type, name, id, and price_per_second).
  • query: The input query for the model.

Returns:

  • output: The generated output from the model.
  • latency: The time taken to run the model.
  • cost: The cost of running the model.

gauge.evaluate(query)

Evaluates multiple LLMs using the given query and displays a table with the results, including the model's name, response, score, explanation, latency, and cost.

Parameters:

  • query: The input query for the models.

Contributing

Contributions to Gauge are welcome! If you'd like to add a new model or improve the existing code, please submit a pull request. If you encounter issues or have suggestions, open an issue on GitHub.

License

Gauge is released under the MIT License.

Acknowledgements

This project was created by Killian Lucas and Roger Hu during the AI Tinkerers Summer Hackathon, which took place on June 10th, 2023 in Seattle at Create 33. The event was sponsored by AWS Startups, Cohere, Madrona Venture Group, and supported by Pinecone, Weaviate, and Blueprint AI. Gauge made it to the semi-finals.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gauge_llm-0.0.121.tar.gz (5.2 kB view hashes)

Uploaded Source

Built Distribution

gauge_llm-0.0.121-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page