Skip to main content

A library for evaluating LLM-based applications.

Project description

GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

Philosophy

Easy Evaluation Everywhere: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

Integration-First Design: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

Extensible by Design: Add new evaluators, metrics, and integrations without breaking existing workflows.

Key Features

  • 🌐 GDP Labs Ecosystem Ready: Standardized evaluation framework across all internal AI applications
  • 🔌 Seamless Integration: Easy integration with experiment tracking and observability platforms
  • 🚀 Async-First Design: High-performance async evaluation with parallel processing
  • 🔧 Extensible Architecture: Easy to add new evaluators and metrics for any use case
  • 🤖 LLM as a Judge: Advanced language models for nuanced, contextual evaluation
  • 📐 Traditional Metrics: Support for classical evaluation metrics and custom scoring functions
  • 🔗 Popular Evaluator Integration: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
  • Zero-Config Start: Get started with sensible defaults, customize as needed

Installation

Prerequisites

Mandatory:

  1. Python 3.11+ — Install here
  2. pip — Install here
  3. uv — Install here
  4. gcloud CLI (for authentication) — Install here, then log in using:
    gcloud auth login
    

Install from Artifact

Because gllm-evals is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the gcloud CLI.

uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals

Local Development Setup

Prerequisites

  1. Python 3.11+ — Install here

  2. pip — Install here

  3. uv — Install here

  4. gcloud CLI — Install here, then log in using:

    gcloud auth login
    
  5. Git — Install here

  6. Access to the GDP Labs SDK GitHub repository


1. Clone Repository

git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals

2. Setup Authentication

Because gllm-evals is a private library, you first need to configure uv to authenticate with our secure Google Cloud repositories. Set the following environment variables to authenticate with internal package indexes:

export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"

3. Quick Setup

Run:

make setup

4. Activate Virtual Environment

source .venv/bin/activate

Local Development Utilities

The following Makefile commands are available for quick operations:

Install uv

make install-uv

Install Pre-Commit

make install-pre-commit

Install Dependencies

make install

Update Dependencies

make update

Run Tests

make test

Adding the Package

Once authorization is configured, you can add gllm-evals to your project:

uv add gllm-evals

Dependencies

The SDK requires:

  • gllm-core and gllm-inference for LLM interactions
  • pydantic for data validation

Quick Start

Basic Usage

import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Batch Evaluation

import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())

Custom Metrics

Create domain-specific metrics easily:

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}

Architecture

Core Components

1. Evaluators

  • BaseEvaluator: Abstract base class for all evaluators - extend for any evaluation scenario
  • GEvalGenerationEvaluator: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

2. Metrics

  • BaseMetric: Abstract base class for metrics - create custom metrics for any domain
  • LMBasedMetric: Generic LM-powered metric evaluation with customizable prompts

3. Datasets

  • BaseDataset: Abstract base class for datasets - support any data format
  • DictDataset: Simple dictionary-based dataset implementation

4. Runner

  • Runner: Runner class for batch evaluation

Metrics

Below is a list of metrics that are currently supported by the SDK.

Metric Description Type Score Range
LMBasedMetric An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt LM-based -
DeepEvalGEvalMetric A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics LM-based -
GEvalCompletenessMetric A metric that can be used to evaluate the completeness of the generated output DeepEval GEval 1-3
GEvalRedundancyMetric A metric that can be used to evaluate the redundancy of the generated output DeepEval GEval 1-3
GEvalGroundednessMetric A metric that can be used to evaluate the groundedness of the generated output DeepEval GEval 1-3
GEvalLanguageConsistencyMetric A metric that can be used to evaluate language consistency between query and generated response DeepEval GEval 0-1
GEvalRefusalMetric A metric that can be used to evaluate refusal behavior from query and expected response DeepEval GEval 0-1
GEvalRefusalAlignmentMetric A metric that can be used to evaluate refusal alignment between expected and generated responses DeepEval GEval 0-1

Evaluators

Below is a list of evaluators that are currently supported by the SDK.

Evaluator Description Type
GEvalGenerationEvaluator An evaluator that can be used to evaluate the quality of the generated output LLM-based

Datasets

Below is a list of datasets that are currently supported by the SDK.

Dataset Description
DictDataset A dataset that loads data from a dictionary
HuggingFaceDataset A dataset that loads data from a HuggingFace dataset

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gllm_evals_binary-0.1.12-cp313-cp313-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.13Windows x86-64

gllm_evals_binary-0.1.12-cp313-cp313-manylinux_2_31_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12-cp313-cp313-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 13.0+ ARM64

gllm_evals_binary-0.1.12-cp312-cp312-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.12Windows x86-64

gllm_evals_binary-0.1.12-cp312-cp312-manylinux_2_31_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12-cp312-cp312-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 13.0+ ARM64

gllm_evals_binary-0.1.12-cp311-cp311-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.11Windows x86-64

gllm_evals_binary-0.1.12-cp311-cp311-manylinux_2_31_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.31+ x86-64

gllm_evals_binary-0.1.12-cp311-cp311-macosx_13_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 13.0+ ARM64

File details

Details for the file gllm_evals_binary-0.1.12-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 263fd245cb3b411c8d244bc006b952626d1ddebdfa251b9380a444b719917498
MD5 68adce8866526273b32166d9dedb752a
BLAKE2b-256 85dae1067e15b76f4730f5da6f045cf25d15bb5ef8d15d85c57b14f82a945f60

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp313-cp313-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12-cp313-cp313-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp313-cp313-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 ae12ae837b68dfa712588e92bd12a2e96aefa17b407c4a09ca1dbf145599faed
MD5 b64f5457505c8935e44ff859f79a2a2c
BLAKE2b-256 d3243bcf49936c912277cc468fc0653d3bb9018094a16d5cf24e62cbd19e1852

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12-cp313-cp313-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp313-cp313-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 d3f3e6864ca6751ad4d1c31a62da35dfff00b146e9a9ea52047cde5646b22267
MD5 a827e748e6716430102e504732b5643f
BLAKE2b-256 20e3d69b89306f8ef058cf7ecc3a6763794a868438e3326478ae524b369d3a4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp313-cp313-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9ef23c88779450e41b892fe2f3e45210e398fc2048a37b07193c09f122a76fb6
MD5 024ebe21abe126c91d7e46f5bac83ef4
BLAKE2b-256 fb7baf305eab3520aa8cfec7a4c6e4ae5744b7d4c130a75289cfd0ae72c60002

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp312-cp312-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 ebef629c920022635da58f45d428f212559576ca0bdfb31fb9ad01dd8e53f044
MD5 bc870b4e3fb716d07857c2e4f046c3b3
BLAKE2b-256 4cb40f1d4d69d29619bdccdf17c029d8a82d10c8a130f84063dc45efc15f5812

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12-cp312-cp312-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp312-cp312-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 b1b020286c2a9f938c5c2512a1cbc7c2d5cf4e34fc795a445614204b63b3d5f3
MD5 052b366be6b3898dfd17819595270042
BLAKE2b-256 702f4e808944b29255facbd89470684f3426da18777cb43f09eb0d9eb2b5dab6

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp312-cp312-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 90057968937d39738d5e55028998cc679180b5d046f5f51ae063d9646507200a
MD5 1967c1eed486e564e992b6cb430fab39
BLAKE2b-256 1d89359c4357357837c4de91cd7300b5a048cd2bfdc1c91b16729add4d925370

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp311-cp311-win_amd64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gllm_evals_binary-0.1.12-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 a303bf1c72edb4e8a80ec7657bee6af1120986d220d98cbc72fdcfa34dab01b9
MD5 2953f35389f969f900733328b06be79a
BLAKE2b-256 a0b07de4c56e055cbc1d5c4a84133370e12362e129e60ae05f5a32b7f39b9ef0

See more details on using hashes here.

File details

Details for the file gllm_evals_binary-0.1.12-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for gllm_evals_binary-0.1.12-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 87af3000f971b92d2a4efc52559612370fdd72eaf7b117bed2145a003bc78332
MD5 a21f1de05c5bc3d715e876a3027fbaba
BLAKE2b-256 523fd810c36f326760d3fdb73f3cd23b776d8131055d5c355d5725b462906a7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for gllm_evals_binary-0.1.12-cp311-cp311-macosx_13_0_arm64.whl:

Publisher: build-binary.yml on GDP-ADMIN/gl-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page