Skip to main content

GrandJury SDK — submit LLM traces for human evaluation + analytics client

Project description

grandjury

Get human feedback on your AI in 3 lines of Python.

from grandjury import GrandJury

gj = GrandJury()  # reads GRANDJURY_API_KEY from env
gj.trace(name="chat", input=prompt, output=response, model="gpt-4o")

Then open your Jupyter notebook:

df = gj.results()  # traces with human votes — as a DataFrame
print(f"Pass rate: {df['pass_rate'].mean():.1%}")

Patent Pending.

What is GrandJury?

HumanJudge connects your AI to a community of human reviewers who evaluate your model's outputs. GrandJury is the Python SDK — it sends traces and retrieves human evaluation results.

Write path: Log AI calls from your app → traces appear in your developer dashboard. Read path: Fetch evaluation results (votes, pass rates, reviewer feedback) into DataFrames for analysis.

Installation

pip install grandjury

Optional performance dependencies:

pip install grandjury[performance]  # msgspec, pyarrow, polars

Quick Start

1. Register your model

Go to humanjudge.com/projects/new, register your AI, and copy the secret key.

export GRANDJURY_API_KEY=gj_sk_live_...

2. Log traces from your app

from grandjury import GrandJury

gj = GrandJury()  # zero-config — reads from env

# Option A: Direct call
gj.trace(name="chat", input="What is ML?", output="Machine learning is...", model="gpt-4o")

# Option B: Decorator — auto-captures input/output/latency
@gj.observe(name="chat", model="gpt-4o")
def call_llm(prompt: str) -> str:
    return openai.chat(prompt)

# Option C: Context manager
with gj.span("chat", input=prompt) as s:
    response = call_llm(prompt)
    s.set_output(response)

3. Get human evaluation results

Once reviewers vote on your traces:

# Trace-level summary
df = gj.results()
# trace_id | input | output | model | pass_count | flag_count | total_votes | pass_rate

# Individual votes with reviewer identity
df_votes = gj.results(detail='votes')
# trace_id | voter_id | voter_name | verdict | flag_category | feedback | created_at

# Filter by benchmark
df_benchmark = gj.results(evaluation='marketing-benchmark')

# Export
df.to_parquet('evaluation_results.parquet')

4. Run analytics

Works on both live platform data and offline datasets:

# Auto-fetch from platform
gj.analytics.vote_histogram()
gj.analytics.population_confidence(voter_list=[...])

# Or pass your own data
import pandas as pd
df = pd.read_csv("my_votes.csv")
gj.analytics.vote_histogram(df)
gj.analytics.votes_distribution(df)

Enroll in Benchmarks

List and enroll your model in open benchmarks programmatically:

# Browse available benchmarks
benchmarks = gj.benchmarks.list()

# Enroll with endpoint config
gj.benchmarks.enroll(
    benchmark_id="...",
    model_id="...",
    endpoint_config={
        "endpoint": "https://api.myapp.com/v1/chat/completions",
        "apiKey": "sk-...",
        "request_template": '{"model":"gpt-4o","messages":[{"role":"user","content":"{{prompt}}"}]}',
        "response_path": "choices[0].message.content"
    }
)

Analytics Methods

All analytics methods work on both platform data (gj.results(detail='votes')) and offline data (pandas/polars/CSV/parquet):

Method Description
gj.analytics.evaluate_model() Decay-adjusted scoring
gj.analytics.vote_histogram() Vote time distribution
gj.analytics.vote_completeness() Completeness per voter
gj.analytics.population_confidence() Confidence metrics
gj.analytics.majority_good_votes() Threshold analysis
gj.analytics.votes_distribution() Votes per inference

Privacy

  • gj.results() only returns traces with at least 1 human vote (privacy gate)
  • Zero-vote traces are invisible to the SDK — only visible on the web dashboard
  • Reviewer identity is public (consistent with platform's public profile/leaderboard model)

API Reference

gj = GrandJury(
    api_key=None,     # reads GRANDJURY_API_KEY from env if not provided
    base_url="https://grandjury-server.onrender.com",
    timeout=5.0,
)

# Write
gj.trace(name, input, output, model, latency_ms, metadata, gj_inference_id)
await gj.atrace(...)  # async version (requires httpx)
gj.observe(name, model, metadata)  # decorator
gj.span(name, input, model, metadata)  # context manager

# Read
gj.results(detail=None, evaluation=None)  # returns DataFrame or list[dict]

# Browse
gj.models.list()
gj.models.get(model_id)
gj.benchmarks.list()
gj.benchmarks.enroll(benchmark_id, model_id, endpoint_config)

# Analytics
gj.analytics.evaluate_model(...)
gj.analytics.vote_histogram(data=None, ...)
gj.analytics.vote_completeness(data=None, voter_list=None, ...)
gj.analytics.population_confidence(data=None, voter_list=None, ...)
gj.analytics.majority_good_votes(data=None, ...)
gj.analytics.votes_distribution(data=None, ...)

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grandjury-2.1.0.tar.gz (192.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grandjury-2.1.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file grandjury-2.1.0.tar.gz.

File metadata

  • Download URL: grandjury-2.1.0.tar.gz
  • Upload date:
  • Size: 192.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for grandjury-2.1.0.tar.gz
Algorithm Hash digest
SHA256 8033674992883a2b966f462bfdb426c90c09e0bb036dd7cf8c686e6276d1d74d
MD5 c2a771eae83fa374cc9255d54df9a90e
BLAKE2b-256 eb25422a2b31c4bb5d6cf5f2647b41aa0abb52168eb80d7084e286f24e36947c

See more details on using hashes here.

File details

Details for the file grandjury-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: grandjury-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for grandjury-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a6593af4d9d4ea99aa9e7625cdf29d051d63276dbfd951a5ff3eec595cb713d
MD5 49682d4c867c17e45b17d51223450f48
BLAKE2b-256 bb0d9396f7455e33830ced2cb81e169d3094e6a8006e9affd186b5b0011b4a1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page