Last released Nov 29, 2025
A Python package for evaluating LLM-generated responses against human references using state-of-the-art LLMs as judges.
Supported by