Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.11-cp313-cp313-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.11-cp313-cp313-manylinux_2_31_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.11-cp313-cp313-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.11-cp312-cp312-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.11-cp312-cp312-manylinux_2_31_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.11-cp312-cp312-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.11-cp311-cp311-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.11-cp311-cp311-manylinux_2_31_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.11-cp311-cp311-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.11-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7469e53722dea974946373e20c2ce125726f8b8ca5d17b7b31e652f710d62cc7
MD5 0d3bbb1ff2629db128c743d03f8317c7
BLAKE2b-256 e25fa46ae142d8b1296d2d80c1706d0c63a4822f82238f20775275f9d30e0175

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.11-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 62d37b511a0bdecd8f724ab9661ba66163992de625925b8b365da60c118959ab
MD5 f3360e52050cd80ffddbd629be4c4e1a
BLAKE2b-256 388cc264237108a21a7c44364940e9b64dc3ab63b8ee40f2e40319d2735166f2

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.11-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 de5da7047467996e426643a367021de38fce2b019e8d6c3352e27bc94209ad2c
MD5 0172b72bd718cd6e6c210bc046e1ab1b
BLAKE2b-256 93eab268b4109e402af4849f86cc850c7bd8c7a77ad7b31f4c050595bdaa47fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.11-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5b4bbbb464069b1654aeb760cd91bd44b0a8d751d3e769f38cca245ddac79e31
MD5 681ad8ea07acb8f3d8d4f07df0d5f2e3
BLAKE2b-256 a28fdd9cac5281114d21d03495980849077c3f2af21d4e0e30e85d4126a584b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.11-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 acc231b561cff5a6ca5ca2fadd782e3d40f0e0d1333fc50e4a77f3fcbe3fb0e9
MD5 85438325199423e459872ed13fc045d9
BLAKE2b-256 4ea835806f94ffed3c360409171d68cbd39337f15a7dd2ff9def329bebf66fba

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.11-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 34f440e4624293bfd7377a1801d1a02a7300bb696d121a71cef8929f486b016c
MD5 abe90642021d72b0b9f54d1a826e36ea
BLAKE2b-256 fd7c8c52b5a94ac72b6124ca21c954c7bb37e87c8bb7d70475c2972756360eb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.11-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0f4efd1f191cb63a6577b2f56eff6d877491529756949cd40de40887fc2c0a92
MD5 a1cc6de5469aa6ddaedd62f06169b007
BLAKE2b-256 0c9a5f9e8fc64fa1690edd35b3958402326205d5cd14c5504bd5fe79faaaf29f

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.11-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 f7d5e7df18dcdc89f9e4890451f95def187dd5420d9ef3f7267497d734201982
MD5 a5be286e8c732b7c1671f1a9a7c2dab1
BLAKE2b-256 e811b5fa83131e1dff3b8cad54641d2aeb503f16bfb933bd7204bf393258a365

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.11-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.11-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 17e3106d76bfb042724443b1d003ab1a34bf2f1b611ebfde9503eef2b01e0343
MD5 3793f68ab64d2741dc47941546ba8cab
BLAKE2b-256 7eb04a0baa4779d783826ac1f62ed890be517c482b6edfb422fa1fa6d27a0eb3

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.11-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page