Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.12.post1-cp313-cp313-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.12.post1-cp313-cp313-manylinux_2_31_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12.post1-cp313-cp313-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.12.post1-cp312-cp312-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.12.post1-cp312-cp312-manylinux_2_31_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12.post1-cp312-cp312-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.12.post1-cp311-cp311-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.12.post1-cp311-cp311-manylinux_2_31_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12.post1-cp311-cp311-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 3192151e2e4c9b791cf38eb0a69d35739b39b09872060dd06d0ba319ecc5c7b1
MD5 37b790b4ed2484cb3114a62b4eaa1fa7
BLAKE2b-256 588a83d33a2f30adf67448afb77a722c046d1e1be255f5d8c89823b9f1d74596

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 0429ec9cbcf94b6beae4c4edf98be6b3ef3f07b5322bc9547808bdeb79adad36
MD5 ac270a220f933cf719a00262898a59d4
BLAKE2b-256 c4fed3abf9e478178df005bb5c18e190ad396213420f370bb08d74d7168f4847

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 7697cc192fa16808baad697cee323566deebf3f85e270f5d6b87a2c091685e1a
MD5 5c0da5695527b38897f1f1fc329e1816
BLAKE2b-256 c6134cd3b1b471a51c9b9468dc33b040f3e6d01b7d4a643764e65930b4485543

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 28d5538360d02b6694fb6be59a53945ab3ad2d3a091d7cc3b38c8916b022a376
MD5 9cb1b6e7ee9f871c1a64eecddbf504ab
BLAKE2b-256 b7f86bab2bed3f31fd89fdb27e471cd19ead9311528d5e5faf65e96d51ad739e

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 142f0fc7d7a1111fa4ad1bd8a5cc3ca9522f73a445bdd4b329d75857b59541fd
MD5 6c5c47f7bf6a0ccbf491aa5e0361cfd4
BLAKE2b-256 d8cd61728e196d57b552dbedcd976f5213defa2fe324ccdb64078be835cb1336

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 db466f54f9e958624db26daf11a270c30318661c56e032c1015fb4fe697f1697
MD5 76a33b2c357b7d2ea8e0812d252f0b11
BLAKE2b-256 0aa15c3a5ad5dd4e30355a9b2eb2c31bbecbf8ef2a293bc97fa803e6af59efd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 78999732198ac0c7fcc3b3ec70690db6431a3136b1a9deee7c2ecc933097341d
MD5 3d8c347d22e46d69eb6c42c03a15ab29
BLAKE2b-256 fc1e216f722033c2a69631639b75edc550efa06a7b87d296ab4142e36fb85133

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 4ea8ff654e11dd4f169c173f36882e5625e6e2c7ad8b7f31c3905ba614482b9e
MD5 9f1e1b3df4b6c7b457e5d1b93d4e767a
BLAKE2b-256 2b3a1dec794a5f9f392511d697222c0365ee48f83ea0eae5feb4cc3ee65cf843

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12.post1-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12.post1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 8893453dd2cb3010506c21b90796f19632c286fd60cbe6b06ba3393f2c5f291b
MD5 837df41d67edcf937e673ba9425a0734
BLAKE2b-256 3084cb61761a111e26a9e48ded5988b024720e84e4a8d646b25046bce0ab5096

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12.post1-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page