Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.14-cp313-cp313-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.14-cp313-cp313-manylinux_2_31_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.14-cp313-cp313-macosx_13_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.14-cp312-cp312-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.14-cp312-cp312-manylinux_2_31_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.14-cp312-cp312-macosx_13_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.14-cp311-cp311-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.14-cp311-cp311-manylinux_2_31_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.14-cp311-cp311-macosx_13_0_arm64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.14-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 5d44d2c1f2f674ceb121c38d0b13b526c91ac856777f63b62cbc832721f9b7cc
MD5 3f9079e33ea7e5ca1ebddc9d1103f9be
BLAKE2b-256 c1093b34ac7d9db2b938e01d61e4cf2b4d2d52e3f003ccab17466cfe6a915db7

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.14-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 e23e3516a3717079850ed61f5aa056481217b3af583bbca62af66ee41e69e384
MD5 8e2912bb59743cbd67632002661c0dfc
BLAKE2b-256 25da4e0f469a7ef70474ab105f5676e2f2e4ff226f62447be6c21ab25a107965

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.14-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 f997e1d50fb1fb0ef44b35700ad6fe7f6060001990fce4c893a55b233030e06b
MD5 1d0ef5af96c2021c7811567c13613dc7
BLAKE2b-256 088325fcbd9b4c340076718f9d91f0bc72dd2d7762f0bc18bdf151e04ff3b6dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.14-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 04b140ddd12035c1962ec39f2333ef1b158a1eca60b403da797d3f5a2eaff8c7
MD5 d53fe3a0098136aae8a618644ed6ece6
BLAKE2b-256 e962af3e44b171dbc4659ac359cde670775962eea9f595a909783cd08d09350b

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.14-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 3d315a901e053aa9d095a06980a51c9d7755d2fd3d705f7ecbb01d823237eb66
MD5 c5432ea4ea77f4526a3cb5fc9173df63
BLAKE2b-256 eb341706b701fb0a36d4aaf354dc0c2531118c0813f83663b473bb9e568a4466

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.14-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 9937fc764d161fac74349b2043b1674a44bccd853555d19744a061e2e5d44267
MD5 959c065b23a0fc9dbcda965050d1cf91
BLAKE2b-256 d066dae68f7e9a9172b8928f6bb29a946f5e7e592594d9f2e56be4be929320bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.14-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 35c4449c7cef62fe1b3fb627995553b68c9be83178fab7dd8dce0410d1a554e2
MD5 94cbbbe886ad7e2cc0adfb468a989908
BLAKE2b-256 f91d6d98514072674a32d62e113f2a46f9b4add3302270ee7b4cfca6b8f8330c

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.14-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 60539f2e90209ff20d3d8f1ca9593ed0764d59e0c6ec1955ac7d2609d8c72a83
MD5 0e2d713a795f6a1b12a92c847f3f6eb0
BLAKE2b-256 8a1aad5e3154fb5e9dca9f8edb8751d3418c8fd5a435475fa8f4e46f338d1e24

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.14-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.14-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 db5b95dd64cd27ecc48de4b2cc364b32f75ec5e666e282eb1d107ee1d6e96603
MD5 6b700a1ee450081664d361066eeb2067
BLAKE2b-256 ea5187822c9d4a3541a67a65ac56e11a533e8a96971ea0030543f78ea03fb299

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.14-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page