Skip to main content

A framework for LLM evaluation through multi-judge councils

Project description

Language Model Council

Can LLMs decide amongst themselves who is the best?

alt text

Mini Manifesto

Language models are outpacing our abilities to evaluate them. Yet, as capable as we hype up LLMs to be, all of the most popular evals the world uses to rank LLMs are still exclusively human-curated and human-designed.

When a new model gets released, Twitter and Reddit get flooded with claims about this latest model being the new best at one thing or another. With so many benchmarks out there, the truth is that deciding which model is the best has become a matter of reputation and taste. I got tired of humans telling me which LLM is best at X, Y, or Z. And so in this work, I desperately wanted to answer a simple question: Can we get LLMs to decide amongst themselves who is the best?

LMSYS demonstrated that GPT-4 agrees with humans at roughly the same rate that humans agree with each other. Today, more and more of us are using models like GPT-4 in place of human raters. But GPT-4 is only one model and today, it's actually one of the "weakest" models on Chatbot Arena.

And how much does a model’s Chatbot Arena rank really tell us about broad human alignment? The PRISM dataset — which was recognized as the Best Paper at NeurIPS in 2024 — showed that rankings shift dramatically depending on which humans you sample. It turns out that for open-ended prompts, there's no single "right" answer — and different populations have fundamentally different views of what “best” means.

On the other side of coin, researchers have also found that LLMs exhibit all sorts of biases on gender, religion, and even value of life. More recent work from the Societal Impacts team at Anthropic and Researchers at the Center for AI Safety are finding even more evidence that: Each model carries its own values — inherited unintentionally or intentionally — from the societies and organizations that built them, and the humans that curate and choose their training data.

In the future, you can imagine that we'll have strong LLMs from every country and many organizations — each shaped by different geopolitical priorities, specializations, and cultural values. More of us are coming to rely on AI for more things, and for things like advising on policies, giving life advice, predicting the future, these domains may be too speculative or too subjective for any single model to evaluate fairly, and disagreement between AI systems will become more prevalent.

So how do we make decisions amidst dissenting opinions in human society? Well, one thing we do in America is called democracy. We all know democracy is far from perfect, but at its core, democracy is a profound idea based on decentralizing power, giving everyone a voice, and relying on the collective to make an important decision.

That's the spirit behind the Language Model Council. Put LLMs in a democracy and give them agency so that they can elect a leader amongst themselves.

This library provides an open-source archive of all the code and analysis used in our original research paper, and serves as a tool for anyone interested in using a council of LLMs to self-evaluate on a set of prompts.

Getting started

  1. Install with pip.
pip install lm-council
  1. Add your openrouter secrets to a .env file.
OPENROUTER_API_KEY = ""

See .env.example for an example. Check here for your openrouter API key.

  1. Configure and execute your council.

You can run the council as a standalone python script or in jupyter notebooks. See examples/ for example notebooks.

from lm_council import LanguageModelCouncil
from dotenv import load_dotenv


def main():
    load_dotenv()

    lmc = LanguageModelCouncil(
        models=[
            "deepseek/deepseek-r1-0528",
            "google/gemini-2.5-flash-lite-preview-06-17",
            "x-ai/grok-3-mini",
            "meta-llama/llama-3.1-8b-instruct",
        ],
    )

    # Run the council on any prompt of your choosing.
    completion, judgment = await lmc.execute("Say hello.")

    # Run the council on many prompts in parallel.
    completions, judgements = await lmc.execute(
        ["Say hello.", "Say goodbye.", "What is your name?", "What is 1 + 1?"]
    )

    # Save and load your council.
    lmc.save("run_0")
    lmc.load("run_0")

    # Shows a leaderboard and returns a scores dataframe.
    return lmc.leaderboard()


asyncio.run(main())

About the Paper [NAACL 2025, Main]

Our paper, "Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks", focuses on a case study involving 20 large language models (LLMs) to evaluate each other on a highly subjective emotional intelligence task, and was the first to study the application of LLM-as-a-Judge in a democratic setting.

hero

Our paper was accepted to NAACL Main, and was presented in Alberqueque, New Mexico in May 2025. You can watch the recording of the talk on YouTube. Slides can be found here.

Authors:

  • Justin Zhao (Independent -> Research Engineer @ Meta Superintelligence Labs)
  • Flor Miriam Plaza-del-Arco (Researcher @ Bocconi University -> Assistant Professor @ Leiden University)
  • Benjamin Genchel (ML Engineer @ Spotify -> Independent)
  • Amanda Cercas Curry (Researcher @ Bocconi University -> Research Scientist @ CENTAI)

You can find in-depth jupyter notebooks to reproduce the findings and figures reported in the Language Model Council paper under analysis/.

Quick links

FAQs

Why OpenRouter?

For ease of maintenance, all model outputs are served by OpenRouter. The original implementation used for the paper used each organization's custom API endpoint through REST, which resulted in a lot of boilerplate code to manage different REST API request schemas and response formats. OpenRouter solves this for us by enabling us to query more models under a single unified interface. For maximally parallelized model querying, we adhere to OpenRouter's rate limits, which we fetch using your API key before the first batch of requests.

Citation

If you find this work helpful or interesting, please consider citing it as so:

@inproceedings{zhao-etal-2025-language,
  title     = {Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks},
  author    = {Zhao, Justin and Plaza-del-Arco, Flor Miriam and Genchel, Benjamin and Curry, Amanda Cercas},
  editor    = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages     = {12395--12450},
  address   = {Albuquerque, New Mexico},
  month     = apr,
  year      = {2025},
  publisher = {Association for Computational Linguistics},
  doi       = {10.18653/v1/2025.naacl-long.617},
  url       = {https://aclanthology.org/2025.naacl-long.617/},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm_council-0.1.0.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lm_council-0.1.0-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file lm_council-0.1.0.tar.gz.

File metadata

  • Download URL: lm_council-0.1.0.tar.gz
  • Upload date:
  • Size: 36.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0

File hashes

Hashes for lm_council-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff4f3bbe49f3bb64fb5d9d85da204fb7c4d6c8d33cdfc66ef8530fc81282834e
MD5 aa2ac44f905da4adc5fe0f4baa4f3c28
BLAKE2b-256 ec15bf6074411f6c4beafe37f509c0236b54a27a446d9986b6dc541cacee8238

See more details on using hashes here.

File details

Details for the file lm_council-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lm_council-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0

File hashes

Hashes for lm_council-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be32c323d801c39ea4327c37df18db4d4564b53a12b71b67d33c0e31c2733eef
MD5 5c8903cf3fd5494c47e1c717e8f0404a
BLAKE2b-256 7cf19a19cb021aebedb21d4a41530ad2cffb40ae2241ea458b174a28d65e01dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page