Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.9.post1-cp313-cp313-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.9.post1-cp313-cp313-manylinux_2_31_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.9.post1-cp313-cp313-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.9.post1-cp312-cp312-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.9.post1-cp312-cp312-manylinux_2_31_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.9.post1-cp312-cp312-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.9.post1-cp311-cp311-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.9.post1-cp311-cp311-manylinux_2_31_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.9.post1-cp311-cp311-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 4cae68aa85e0951877d97fb9c2b961499ae0c762b6cd36237f9bcb799f31491e
MD5 acdf90b941cc0ac2148979cdecfe19b4
BLAKE2b-256 dc9594e0c1d9ab6a65058c626b7397a8442eedc5eec06a81b0417650f25d5392

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 ddced48986b2183d06ff4ce879e80724ac881b7b5dd43e1d92fe21d2d86ce7b9
MD5 8a2bef301ca963e6a76a8948b5744bf0
BLAKE2b-256 e4146f965dcb8c4769c65cd3dd11434afafb66ba613b049b7573d1bf7b2e91bf

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 fc3edbe40cf41876205b48d30dc9fb831e16f565488d1790989341f9e5608679
MD5 88a3d77b30eb6e2254f15a40f1e81caf
BLAKE2b-256 ae8c00e5537d0eac2ace8a57008b9f84a4c8710314a9b76b544853ad799503eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4799595a10c9f1b5aad1899ef627663392dcdd7fc1c297e6bf601bea19387a72
MD5 41485882079cd3cefee518d2dcabb2dd
BLAKE2b-256 56239349507f59fef77f3cfb1329ffbe32c53acf1edb92bf53b2c929730bb2f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 a82a185a4debd8b6ad883af4838fbd9b32cedccf12dd43e4157d36892c76b317
MD5 7a33fb26f6cc83fc1ac6a2d97f3410da
BLAKE2b-256 ad18417f0e5c7a04141741a9922b12f803d7022a04625b4c2976c09ac87a161c

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b135c8f92d3b5a2cbc59432137b853876cb235d7809213ae12aba5df6aecbb2b
MD5 010d88ec26c815da274195ca82585386
BLAKE2b-256 8aab4ae6f2c0763918c3de6930ff120a4d45b654a2480f575c4068f7df1dc2ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 acb690a5682bf4799b0a661c47b569bc2631af3688bb25447f2627ecf68fa15e
MD5 f8c8f33546a949dcb5a7fa226ea32459
BLAKE2b-256 237ab6b7878b1439fe00ae5522a153f3386f0bb0842a8e5c77bcc2f2ff1b1506

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 2e83fb1eebbee8f31f27c5234b241227d5b7ef730b8d91f6ec78b31b69548b15
MD5 1bc012e0e7a34aaa0b190e68dc403356
BLAKE2b-256 8cc47b5884ad951b981c8789ab04e92b9646afba3cce0de190905eff4076eb8f

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.9.post1-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.9.post1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 dd082f0bcfda194bb7fd770373eaf99b7872ff2faba87141bd50338c859a4b27
MD5 4a9d22661cca29e55f3211504de47a5c
BLAKE2b-256 2c9f6409c1047ad554fb1f271d0b55fb84ec0cac4865185a924159ce119c3530

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.9.post1-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page