Skip to main content

A small yet powerful LM Judge

Project description

flow-judge

Flow Judge Banner

Technical Report | Model Weights | HuggingFace Space | Evaluation Code | Tutorials

flow-judge is a lightweight library for evaluating LLM applications with Flow-Judge-v0.1.

GitHub stars Release YouTube Channel Views Build Code coverage License

Model

Flow-Judge-v0.1 is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.

You can learn more about the unique features of our model in the technical report.

Features of the library

  • Support for multiple model types: Hugging Face Transformers and vLLM
  • Extensible architecture for custom metrics
  • Pre-defined evaluation metrics
  • Ease of custom metric and rubric creation
  • Batched evaluation for efficient processing
  • Integrations with most popular frameworks like Llama Index

Installation

Install flow-judge using pip:

pip install -e ".[vllm,hf]"
pip install 'flash_attn>=2.6.3' --no-build-isolation

Extras available:

  • dev to install development dependencies
  • hf to install Hugging Face Transformers dependencies
  • vllm to install vLLM dependencies
  • llamafile to install Llamafile dependencies
  • baseten to install Baseten dependencies

Quick Start

Here's a simple example to get you started:

from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Sample to evaluate
query = """..."""
context = """...""""
response = """..."""

# Create an EvalInput
# We want to evaluate the response to the customer issue based on the context and the user instructions
eval_input = EvalInput(
    inputs=[
        {"query": query},
        {"context": context},
    ],
    output={"response": response},
)

# Run the evaluation
result = faithfulness_judge.evaluate(eval_input, save_results=False)

# Display the result
display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))

Usage

Inference Options

The library supports multiple inference backends to accommodate different hardware configurations and performance needs:

  1. vLLM:

    • Best for NVIDIA GPUs with Ampere architecture or newer (e.g., RTX 3000 series, A100, H100)
    • Offers the highest performance and throughput
    • Requires CUDA-compatible GPU
    from flow_judge import Vllm
    
    model = Vllm()
    
  2. Hugging Face Transformers:

    • Compatible with a wide range of hardware, including older NVIDIA GPUs
    • Supports CPU inference (slower but universally compatible)
    • It is slower than vLLM but generally compatible with more hardware.

    If you are running on an Ampere GPU or newer:

    from flow_judge import Hf
    
    model = Hf()
    

    If you are not running on an Ampere GPU or newer, disable flash attention:

    from flow_judge import Hf
    
    model = Hf(flash_attn=False)
    
  3. Llamafile:

    • Ideal for non-NVIDIA hardware, including Apple Silicon
    • Provides good performance on CPUs
    • Self-contained, easy to deploy option
    from flow_judge import Llamafile
    
    model = Llamafile()
    
  4. Baseten:

    • Remote execution.
    • Machine independent.
    • Improved concurrency patterns for larger workloads.
from flow_judge import Baseten

model = Baseten()

For detailed information on using Baseten, visit the Baseten readme.

Choose the inference backend that best matches your hardware and performance requirements. The library provides a unified interface for all these options, making it easy to switch between them as needed.

Evaluation Metrics

Flow-Judge-v0.1 was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.

Pre-defined Metrics

For convenience, flow-judge library comes with pre-defined metrics such as RESPONSE_CORRECTNESS or RESPONSE_FAITHFULNESS. You can check the full list by running:

from flow_judge.metrics import list_all_metrics

list_all_metrics()

Batched Evaluations

For efficient processing of multiple inputs, you can use the batch_evaluate method:

# Read the sample data
import json
from flow_judge import Vllm, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Initialize the model
model = Vllm()

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model
)

# Load some sampledata
with open("sample_data/csr_assistant.json", "r") as f:
    data = json.load(f)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": sample["query"]},
        {"context": sample["context"]},
    ]
    for sample in data
]
outputs_batch = [{"response": sample["response"]} for sample in data]

# Create a list of EvalInput
eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

# Run the batch evaluation
results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

# Visualizing the results
for i, result in enumerate(results):
    display(Markdown(f"__Sample {i+1}:__"))
    display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score}"))
    display(Markdown("---"))

Advanced Usage

[!WARNING] There exists currently a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the Hf() model configuration for longer contexts at the moment. For more details, refer to: #33129 and #6135

Custom Metrics

Create your own evaluation metrics:

from flow_judge.metrics import CustomMetric, RubricItem

custom_metric = CustomMetric(
    name="My Custom Metric",
    criteria="Evaluate based on X, Y, and Z.",
    rubric=[
        RubricItem(score=0, description="Poor performance"),
        RubricItem(score=1, description="Good performance"),
    ],
    required_inputs=["query"],
    required_output="response"
)

judge = FlowJudge(metric=custom_metric, config="Flow-Judge-v0.1-AWQ")

Integrations

We support an integration with Llama Index evaluation module and Haystack:

Note that we are currently working on adding more integrations with other frameworks in the near future.

Development Setup

  1. Clone the repository:

    git clone https://github.com/flowaicom/flow-judge.git
    cd flow-judge
    
  2. Create a virtual environment:

    virtualenv ./.venv
    

    or

    python -m venv ./.venv
    
  3. Activate the virtual environment:

    • On Windows:
      venv\Scripts\activate
      
    • On macOS and Linux:
      source venv/bin/activate
      
  4. Install the package in editable mode with development dependencies:

    pip install -e ".[dev]"
    

    or

    pip install -e ".[dev,vllm]"
    

    for vLLM support.

  5. Set up pre-commit hooks:

    pre-commit install
    
  6. Make sure you have trufflehog installed:

    # make trufflehog available in your path
    # macos
    brew install trufflehog
    # linux
    curl -sSfL https://raw.githubusercontent.com/trufflesecurity/trufflehog/main/scripts/install.sh | sh -s -- -b /usr/local/bin
    # nix
    nix profile install nixpkgs#trufflehog
    
  7. Run pre-commit on all files:

    pre-commit run --all-files
    
  8. You're now ready to start developing! You can run the main script with:

    python -m flow_judge
    

Remember to always activate your virtual environment when working on the project. To deactivate the virtual environment when you're done, simply run:

deactivate

Running Tests

To run the tests for Flow-Judge, follow these steps:

  1. Navigate to the root directory of the project in your terminal.

  2. Run the tests using pytest:

    pytest tests/
    

    This will discover and run all the tests in the tests/ directory.

  3. If you want to run a specific test file, you can do so by specifying the file path:

    pytest tests/test_flow_judge.py
    
  4. For more verbose output, you can use the -v flag:

    pytest -v tests/
    

Contributing

Contributions to flow-judge are welcome! Please follow these steps:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Please ensure that your code adheres to the project's coding standards and passes all tests.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

FOSSA Status

Acknowledgments

Flow-Judge is developed and maintained by the Flow AI team. We appreciate the contributions and feedback from the AI community in making this tool more robust and versatile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flow_judge-0.1.2.tar.gz (342.1 kB view details)

Uploaded Source

Built Distribution

flow_judge-0.1.2-py3-none-any.whl (81.7 kB view details)

Uploaded Python 3

File details

Details for the file flow_judge-0.1.2.tar.gz.

File metadata

  • Download URL: flow_judge-0.1.2.tar.gz
  • Upload date:
  • Size: 342.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for flow_judge-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c720a58e5b571abc8a045f169f0eb10f5a00bced397e5208792dad3d9aacf8b9
MD5 d84810d088d2ca7c35b3ceb08bdd1da5
BLAKE2b-256 7d9b0d5d489c76c880b8de711663f8ae16487f7777700c04c0cca627e59fb89e

See more details on using hashes here.

File details

Details for the file flow_judge-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: flow_judge-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 81.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for flow_judge-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aa38617fed8311fc7168d3e7830fc9a885c324e408c1b0e72d0e8f7ad3aaa01a
MD5 6f6b4eb0d8075635eb80d64ef8a0427d
BLAKE2b-256 87a6e2054f4371fa4faadc61182b314d22a9f685492c369ae4fc228ae95c3431

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page