Skip to main content

Galtea software development kit

Reason this release was yanked:

This SDK version is no longer supported by the Galtea API. Please update to 2.4.0

Project description

Galtea SDK

Galtea

Comprehensive AI/LLM Testing & Evaluation Framework

PyPI version Python versions License

Overview

Galtea SDK empowers AI engineers, ML engineers and data scientists to rigorously test and evaluate their AI products. With a focus on reliability and transparency, Galtea offers:

  1. Automated Test Dataset Generation - Create comprehensive test datasets tailored to your AI product
  2. Sophisticated Product Evaluation - Evaluate your AI products across multiple dimensions

Installation

pip install galtea

Development

Building the Project

This project uses Poetry for dependency management and packaging. To build the project:

poetry build

This will create distribution packages (wheel and source distribution) in the dist/ directory.

Development Setup

# Install dependencies
poetry install

# Activate the virtual environment
poetry shell

Quick Start

from galtea import Galtea
import os

# Initialize with your API key
galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# Create a test, which is a collection of test cases
test = galtea.tests.create(
    name="factual-accuracy-test",
    type="QUALITY",
    product_id="your-product-id",
    ground_truth_file_path="path/to/ground-truth.pdf"
)

# Get test cases to iterate over
test_cases = galtea.test_cases.list(test.id)

# Create a version, which is a specific iteration of your product
version = galtea.versions.create(
    name="gpt-4-self-hosted-v1",
    product_id="your-product-id",
    description="Self-hosted GPT-4 equivalent model",
    endpoint="http://your-model-endpoint.com/v1/chat"
)

for test_case in test_cases:
    # Simulate a call to your product to get its output for a given test case
    # In a real scenario, you would call your actual product endpoint
    model_answer = f"The answer to '{test_case.input}' is..."    # Run an evaluation task
    # An Evaluation is implicitly created to group these tasks
    galtea.evaluation_tasks.create_single_turn(
        metrics=["factual-accuracy", "coherence", "relevance"],
        version_id=version.id,
        test_case_id=test_case.id,
        actual_output=model_answer
    )

Core Features

1. Test Creation

  • Quality Tests: Assess response quality, coherence, and factual accuracy
  • Adversarial Tests: Stress-test your models against edge cases and potential vulnerabilities
  • Ground Truth Integration: Upload ground truth documents to validate factual responses
  • Custom Test Types: Define tests tailored to your specific use cases and requirements
# Create a custom test with your own dataset
test = galtea.tests.create(
    name="medical-knowledge-test",
    type="QUALITY",
    product_id="your-product-id",
    ground_truth_file_path="medical_reference.pdf"
)

2. Comprehensive Product Evaluation

Evaluate your AI products with sophisticated metrics:

  • Multi-dimensional Analysis: Analyze outputs across various dimensions including accuracy, relevance, and coherence
  • Customizable Metrics: Define your own evaluation criteria and rubrics
  • Batch Processing: Run evaluations on large datasets efficiently
  • Detailed Reports: Get comprehensive insights into your model's performance
# Define custom evaluation metrics
custom_metric = galtea.metrics.create(
    name="medical-accuracy",
    criteria="Assess if the response is medically accurate based on the provided context.",
    evaluation_params=["actual output", "context"]
)

# Run batch evaluation
import pandas as pd

# Load your test data
test_data = pd.read_json("medical_queries.json")

# 1. Create a session for this batch evaluation
session = galtea.sessions.create(version_id=version.id, is_production=True)

# 2. Log each interaction as an inference result
for _, row in test_data.iterrows():
    # Get response from your product
    model_response = your_product_function(row["query"], row["medical_context"])
    
    # Log each turn to the session
    galtea.inference_results.create(
        session_id=session.id,
        input=row["query"],
        output=model_response,
        retrieval_context=row["medical_context"]
    )

# 3. Evaluate the entire session at once
galtea.evaluation_tasks.create(
    metrics=[custom_metric.name, "coherence", "toxicity"],
    session_id=session.id
)

Managing Your AI Products

Galtea provides a complete ecosystem for evaluating and monitoring your AI products:

Products

Represents a functionality or service that is evaluated by Galtea.

# List your products
products = galtea.products.list()

# Select a product to work with
product = products[0]

Versions

Represents a specific iteration of a product. This allows for tracking improvements and regressions over time.

# Create a new version of your product
version = galtea.versions.create(
    name="gpt-4-fine-tuned-v2",
    product_id=product.id,
    description="Fine-tuned GPT-4 for medical domain",
    model_id="gpt-4",
    system_prompt="You are a helpful medical assistant..."
)

# List versions of your product
versions = galtea.versions.list(product_id=product.id)

Tests

A collection of test cases designed to evaluate specific aspects of your product versions.

# Create a test
test = galtea.tests.create(
    name="medical-qa-test",
    type="QUALITY",
    product_id=product.id,
    ground_truth_file_path="medical_data.pdf"
)

# Download a test file
test_file = galtea.tests.download(test, output_dir="tests")

Test Cases

A single challenge for evaluating product performance. Each test case typically includes an input and may include an expected output and context.

Sessions

A group of inference results that represent a complete conversation between a user and an AI system.

Inference Results

A single turn in a conversation, including the user's input and the AI's output. These are the raw interactions that can be evaluated.

Evaluations

A group of inference results from a session that can be evaluated. It acts as a container for all the evaluation tasks that measure how effectively the product version performs.

# Evaluations are created implicitly when you log evaluation tasks.
# For example, when you run this, an evaluation is created behind the scenes:
galtea.evaluation_tasks.create_single_turn(
    metrics=["factual-accuracy"],
    version_id=version.id,
    test_case_id=test_cases[0].id,
    actual_output="Some output from your product."
)

# List evaluations for a product
evaluations = galtea.evaluations.list(product_id=product.id)

Advanced Usage

Custom Metrics

Define custom evaluation criteria specific to your needs:

# Create a custom metric
custom_metric_1 = galtea.metrics.create(
    name="patient-safety-score-v1",
    criteria="Evaluate responses for patient safety considerations",
    evaluation_params=["actual output"]
)

Batch Processing

Efficiently evaluate your model on large datasets:

import pandas as pd
import os

# Load your test queries from a JSON file
queries_file = os.path.join(os.path.dirname(__file__), 'test_data.json')
df = pd.read_json(queries_file)

# Create a session for this batch evaluation
session = galtea.sessions.create(version_id=version.id, is_production=True)

# Process each query
for idx, row in df.iterrows():
    # Get your model's response to the query
    model_response = your_product_function(row['query'])

    # Log each turn to the session
    galtea.inference_results.create(
        session_id=session.id,
        input=row['query'],
        output=model_response
    )

# Evaluate the entire session
galtea.evaluation_tasks.create(
    metrics=["relevance", custom_metric_1.name],
    session_id=session.id
)

API Reference

Main Classes

  • Galtea: Main client for interacting with the Galtea platform

Product Management

  • galtea.products.list(offset=None, limit=None): List available products
  • galtea.products.get(product_id): Get a specific product by ID

Test Management

  • galtea.tests.create(name, type, product_id, ground_truth_file_path=None, test_file_path=None): Create a new test
  • galtea.tests.get(test_id): Retrieve a test by ID
  • galtea.tests.list(product_id, offset=None, limit=None): List tests for a product
  • galtea.tests.download(test, output_dir): Download test files in the selected directory.

Test Cases Management

  • galtea.test_cases.create(test_id, input, expected_output, context=None): Create a new test case
  • galtea.test_cases.get(test_case_id): Get a test case by ID
  • galtea.test_cases.list(test_id, offset=None, limit=None): List test cases for a test
  • galtea.test_cases.delete(test_case_id): Delete a test case by ID

Version Management

  • galtea.versions.create(product_id, name, description=None, ...): Create a new product version
  • galtea.versions.get(version_id): Get a version by ID
  • galtea.versions.list(product_id, offset=None, limit=None): List versions for a product

Metric Management

  • galtea.metrics.create(name, criteria=None, evaluation_steps=None, evaluation_params=None): Create a custom metric
  • galtea.metrics.get(metric_type_id): Get a metric by ID
  • galtea.metrics.list(offset=None, limit=None): List available metrics

Session Management

  • galtea.sessions.create(version_id, ...): Create a new session to log a conversation.
  • galtea.sessions.get(session_id): Get a session by ID.
  • galtea.sessions.list(version_id, ...): List sessions for a version.
  • galtea.sessions.delete(session_id): Delete a session by ID.

Inference Result Management

  • galtea.inference_results.create(session_id, input, output, ...): Log a single turn in a conversation.
  • galtea.inference_results.get(inference_result_id): Get an inference result by ID.
  • galtea.inference_results.list(session_id, ...): List inference results for a session.
  • galtea.inference_results.delete(inference_result_id): Delete an inference result by ID.

Evaluation Management

  • An Evaluation is created implicitly when you create evaluation tasks.
  • galtea.evaluations.get(evaluation_id): Get an evaluation by ID
  • galtea.evaluations.list(product_id, offset=None, limit=None): List evaluations for a product

Evaluation Tasks Management

  • galtea.evaluation_tasks.list(evaluation_id, offset=None, limit=None): List tasks performed for an evaluation
  • galtea.evaluation_tasks.get(evaluation_task_id): Get a specific task by ID
  • galtea.evaluation_tasks.create(metrics, session_id): Create evaluation tasks for all inference results within a given session.
  • galtea.evaluation_tasks.create_single_turn(metrics, version_id, ...): Create an evaluation task for a single-turn interaction, such as one based on a specific test case or a production query.

Getting Help

Authors

This software has been developed by the members of the product team of Galtea Solutions S.L.

License

Apache License 2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

galtea-2.0.0.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

galtea-2.0.0-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file galtea-2.0.0.tar.gz.

File metadata

  • Download URL: galtea-2.0.0.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.11.0-1015-azure

File hashes

Hashes for galtea-2.0.0.tar.gz
Algorithm Hash digest
SHA256 baa521c85f66c9c0aa238fdffa6b2e434de03fa6fccc9480b36654bc33366a3a
MD5 a1dd10e1a9cbc443252e984e8634aa7f
BLAKE2b-256 b203e6b01f0a310c6875fe1854e420b1fa5356f9f1c0ddf0a4b28f7c432b0621

See more details on using hashes here.

File details

Details for the file galtea-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: galtea-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.11.0-1015-azure

File hashes

Hashes for galtea-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22e46ad02160e073ccb65f77fba4d7c2972e58d33a720fa5819774e3c850d3a6
MD5 d21d672fd232939bef732047b48be4e9
BLAKE2b-256 b76b7137b495f4fc94706bca71634509b2bed872fa14977b832e19ab3112b308

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page