Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.13-cp313-cp313-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.13-cp313-cp313-manylinux_2_31_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.13-cp313-cp313-macosx_13_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.13-cp312-cp312-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.13-cp312-cp312-manylinux_2_31_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.13-cp312-cp312-macosx_13_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.13-cp311-cp311-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.13-cp311-cp311-manylinux_2_31_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.13-cp311-cp311-macosx_13_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.13-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 52f7c76b782915e6a82386edfe0f6fbddd02ff6b80e71d98a506147abc44b387
MD5 0436e58ea706304a1373ca965cc9894b
BLAKE2b-256 23651b5673dd4f9be350bae9c045958489ca4c69cdcc0da4d679cc47d144454f

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.13-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 748845e17a94da73e471d85944ab8ae5619f4906fb8214b4eb61890538664340
MD5 5218d48e28ac020ac9f958a831f3cf52
BLAKE2b-256 62c91f4c13dc942b09fdc6775b3b682dd350a080014c0cb0245b9a5deb7d800f

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.13-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 89b9040531d3d5f7e3f14a0db0ce8b66fe632addafaeca03bfc92fa86740c3be
MD5 16922746973c9a92275ab1f7e480c823
BLAKE2b-256 c0d7e94f1ee867edcd633dbf0e2a153d1d70d26a5a18b8fb2aed96cd849bc08a

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.13-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 96cae45dc513d0b62015e6e38b1659a29f4cb2c6d343bb9c003464bdd86b12a6
MD5 12a819786e5751b6cadbb1fdf0ec0bb7
BLAKE2b-256 32a52aa4c09dca04ed5c4646a97d104bd7beda770db05273e0910964d7ac5a85

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.13-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 0b61f3b05ec4f735b7df53bd6a9bb8691c101ac1ce59b48c887472767d236337
MD5 2fe74a8f6c472aeb27f42607049472c1
BLAKE2b-256 51302f06df90267727e4760c5db8fd6d7de94034a99466bcc4166c8dbebfad31

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.13-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 9b14f5e5e3100b33f99712cd5c569be8c441ff5ca022c58c37802b9f50df1a6c
MD5 c15b9b9c34dccdbf3332c4b27ed752bf
BLAKE2b-256 b06f7cdfa33cf8e84c824bdb7e381aac369970abcaf9806bdcfc0c8af59a7bf9

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.13-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1a6a5c6475c601195c99b15df9768d9895576886c05a0c067a2f9eaf92aef3b6
MD5 9d30c01221059814e7152e4097995d7a
BLAKE2b-256 a28582be08c043da97cf7617fe042ec9b61b4b974b95df21d6a21eac944e41f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.13-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 89512f7e79606b2074f5aee8d7aeb4109d421fabd152dad2960ddb9d79d25e7d
MD5 969f533d64fdbfff976bb69a1912d21c
BLAKE2b-256 dcfe901a935ee86d601aedb65eab43693c4d2797707e7d251f0ba17602c6059d

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.13-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.13-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 1be5c78ac34bb3c31ad2a39fdd66fb5a71f858cb82c0476f99c394048445e5b8
MD5 da85b49ffdf94e6bcf3fda2259fca9a8
BLAKE2b-256 b07d43a7709b98509dc575818aeaf5443aefa123ace9e594e45dfde62c81c89e

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.13-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page