Skip to main content

Open-source framework for evaluating AI-generated media quality

Project description

evalmedia

Open-source framework for evaluating AI-generated media quality.

Think "DeepEval but for generative media." Structured, actionable quality assessments for AI-generated images — designed for AI agents, not dashboards.

Website | PyPI | GitHub

Install

pip install evalmedia

With judge backends:

pip install evalmedia[claude]    # Anthropic Claude
pip install evalmedia[openai]    # OpenAI GPT-4.1
pip install evalmedia[all]       # Everything

Quick Start

Single image evaluation

from evalmedia import ImageEval
from evalmedia.checks.image import FaceArtifacts, PromptAdherence, TextLegibility

result = ImageEval.run(
    image="output.png",
    prompt="a woman holding a coffee cup in a cafe",
    checks=[FaceArtifacts(), PromptAdherence(), TextLegibility()],
)

print(result.passed)        # False
print(result.summary())     # "FAIL — 2/3 checks passed (score: 0.65). Failed: face_artifacts."
print(result.to_dict())     # structured JSON for agents

Rubric-based evaluation

from evalmedia import ImageEval
from evalmedia.rubrics import Portrait

result = ImageEval.run(
    image="output.png",
    prompt="professional headshot of a young man",
    rubric=Portrait(),
)

Built-in rubrics: GeneralQuality, Portrait, MarketingAsset.

Async support

result = await ImageEval.arun(
    image=image_bytes,
    prompt=prompt,
    checks=[FaceArtifacts(), PromptAdherence()],
)

Compare multiple images

from evalmedia import compare
from evalmedia.rubrics import GeneralQuality

results = await compare(
    images=["modelA.png", "modelB.png", "modelC.png"],
    prompt="a sunset over mountains",
    rubric=GeneralQuality(),
)

best_label, best_result = results.best()

Checks

Check Type What it evaluates
PromptAdherence VLM Does the image match what was asked for?
FaceArtifacts VLM Distorted faces, wrong eye count, melted features
HandArtifacts VLM Extra/missing fingers, distorted hands
TextLegibility VLM Is text in the image spelled correctly and readable?
AestheticQuality VLM Composition, lighting, color harmony
StyleConsistency VLM Does it match a style reference image?
CLIPSimilarity Classical CLIP cosine similarity between prompt and image
ResolutionAdequacy Classical Is the resolution sufficient?

Configuration

import evalmedia

# Set global default judge
evalmedia.set_judge("claude", api_key="sk-...")

# Or via environment variables
# EVALMEDIA_DEFAULT_JUDGE=claude
# EVALMEDIA_ANTHROPIC_API_KEY=sk-...
# EVALMEDIA_OPENAI_API_KEY=sk-...

CLI

# Evaluate an image
evalmedia check output.png --prompt "a woman in a cafe" --checks face_artifacts,prompt_adherence

# Use a rubric
evalmedia check output.png --prompt "headshot" --rubric portrait --format json

# Compare images
evalmedia compare outputs/ --prompt "sunset" --rubric general_quality

# List available checks and rubrics
evalmedia list-checks
evalmedia list-rubrics

Agent Integration

Use evalmedia as a tool in AI agent workflows:

from evalmedia.integrations import openai_tool_schema, anthropic_tool_schema

# OpenAI function calling
tools = [openai_tool_schema()]

# Anthropic tool_use
tools = [anthropic_tool_schema()]

Custom Rubrics

from evalmedia.rubrics import Rubric, WeightedCheck
from evalmedia.checks.image import PromptAdherence, TextLegibility, AestheticQuality

rubric = Rubric(
    name="my_rubric",
    checks=[
        WeightedCheck(check=PromptAdherence(), weight=0.4),
        WeightedCheck(check=TextLegibility(), weight=0.3),
        WeightedCheck(check=AestheticQuality(), weight=0.3),
    ],
    pass_threshold=0.75,
)

Or via YAML:

name: my_rubric
pass_threshold: 0.75
checks:
  - check: prompt_adherence
    weight: 0.4
  - check: text_legibility
    weight: 0.3
  - check: aesthetic_quality
    weight: 0.3

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalmedia-0.1.0.tar.gz (152.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalmedia-0.1.0-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file evalmedia-0.1.0.tar.gz.

File metadata

  • Download URL: evalmedia-0.1.0.tar.gz
  • Upload date:
  • Size: 152.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for evalmedia-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c9721d3d7f7c9fc863ce181e173f0d974c8b778be3d2c95225b2a84610962a5e
MD5 fdceda413e6a579edd349f1813bde234
BLAKE2b-256 6e0c2a99fbdef0a2b37358cfa14d2c1c0f5c866df9dbfc9f59c5071e57a9344b

See more details on using hashes here.

File details

Details for the file evalmedia-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evalmedia-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for evalmedia-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5938f96d31d65bcc61f1cfebff3985a22c65eea70f3aa7be978953953552eeeb
MD5 f66b4b1ba2a8ed7c2df259668c67917b
BLAKE2b-256 1eb53f547894d385ba446a986a22671330e33fa74313ba111ef32dd2d8e3ecc5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page