Skip to main content

A multi-agent system for evaluating and refining labels on text data

Project description

AgentCompany Label Evaluation System

A multi-agent system for evaluating and refining labels on text data using the smolagent framework. The system employs a manager-worker architecture where multiple agents collaborate to reach consensus on label quality.

Features

  • Manager Agent: Analyzes dataset, assesses difficulty, creates worker agents with specific roles
  • Worker Agents: Evaluate labels with diverse personas and perspectives
  • Consensus Engine: Multiple strategies for resolving disagreements
  • Dynamic Agent Generation: Manager decides how many workers to create and their specific roles
  • Hybrid Resolution: Discussion first, then manager tie-break if needed
  • Pydantic Validation: Type-safe input/output with Pydantic models
  • Retry Logic: Automatic retry with exponential backoff for API calls
  • Parallel Processing: Process multiple items concurrently with ThreadPoolExecutor

Supported Task Types

  • Classification (sentiment, category, etc.)
  • Translation Selection (choosing best translation)
  • Keyword Extraction
  • Custom label types (user-defined)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Input JSON File                      │
│  { "task_config": {...}, "items": [...] }               │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  MANAGER AGENT (smolagent)              │
│  • Samples items to assess difficulty                  │
│  • Creates worker agents with specific roles           │
│  • Plans evaluation strategy                           │
│  • Tie-breaker when consensus fails                    │
└─────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌───────────┐   ┌───────────┐   ┌───────────┐
    │  Worker 1 │   │  Worker 2 │   │  Worker N │
    │ (strict)  │◄──┤ (creative)│◄──┤ (domain)  │
    └───────────┘   └───────────┘   └───────────┘
          │               │               │
          └───────────────┴───────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  CONSENSUS ENGINE                       │
│  • Check consensus threshold (configurable, default 60%)│
│  • Resolve via majority vote, union, or discussion     │
│  • Manager tie-break if still disagreement             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Output JSON File                      │
│  [{ "id": "...", "labels": {...}, "evaluation": {...} }]│
└─────────────────────────────────────────────────────────┘

Installation

From PyPI

pip install agentic-eval-team

From source

pip install -e .

Requirements

  • Python 3.10+
  • smolagents
  • openai
  • pydantic
  • tqdm

Quick Start

1. Prepare Input File

Create an input JSON file with task_config and items:

{
  "task_config": {
    "description": "Evaluate whether the assigned categories and sentiments are accurate",
    "evaluation_criteria": ["accuracy", "consistency", "completeness"],
    "consensus_strategy": "discussion_then_vote",
    "consensus_threshold": 0.6,
    "max_discussion_rounds": 2
  },
  "items": [
    {
      "id": "sample_001",
      "text": "The recent advancement in AI has revolutionized healthcare diagnostics.",
      "labels": {
        "category": "technology",
        "sentiment": "positive"
      }
    }
  ]
}

2. Run Evaluation

# Using the CLI command (after installation)
agentic-eval input.json -o output.json --mock

# Or using Python module
python -m agentic_eval_team input.json -o output.json --mock

# With a real model
agentic-eval input.json -o output.json --endpoint http://localhost:8000/v1 --model llama-3.1-8b

Configuration

Command Line Arguments

Argument Description
input Input JSON file path (required)
-o, --output Output JSON file path
--endpoint vLLM/OpenAI-compatible endpoint
--model Model identifier
--mock Use mock model for testing
--parallel Max parallel workers for item processing (default: 4)

Task Config Options

Field Default Description
description "Evaluate labels for accuracy and quality" Task description
evaluation_criteria ["accuracy", "reasonableness"] What to evaluate
consensus_strategy "discussion_then_vote" Resolution strategy
consensus_threshold 0.6 Threshold for consensus (0.0-1.0)
max_discussion_rounds 2 Max rounds before tie-break

Consensus Strategies

Strategy Description
majority_vote Simple majority wins
discussion_then_vote Discuss until threshold agreement, then vote
full_consensus Require unanimous agreement
union Combine all extractions (for keywords)

Project Structure

agentic-eval-team/
├── agentic_eval_team/           # Main package
│   ├── __init__.py
│   ├── __main__.py            # CLI entry point
│   ├── config.py              # Configuration
│   ├── models/
│   │   ├── manager.py         # Manager agent
│   │   ├── worker.py          # Worker agents (with retry logic)
│   │   ├── tools.py           # Manager tools
│   │   ├── schema.py          # Pydantic models
│   │   └── mock_model.py      # Mock model for testing
│   ├── consensus/
│   │   ├── engine.py          # Consensus orchestration
│   │   └── strategies.py      # Resolution strategies
│   ├── evaluation/
│   │   └── runner.py          # Parallel processing runner
│   ├── tasks/
│   │   ├── router.py          # Task type detection
│   │   └── prompts.py          # Prompt templates
│   └── utils/
│       ├── io.py              # JSON I/O utilities
│       ├── retry.py           # Retry decorator
│       └── errors.py          # Custom exceptions
├── samples/
│   └── input_sample.json      # Sample input
├── tests/
│   └── test_core.py           # Unit tests
├── pyproject.toml            # Package configuration
└── README.md

Testing

python -m unittest discover tests -v

Mock Testing

Use --mock flag to test without a running LLM server:

agentic-eval samples/input_sample.json -o output.json --mock --parallel 2

Key Improvements

Retry Logic

Worker agents automatically retry failed API calls with exponential backoff:

  • Max retries: 3 (configurable)
  • Initial delay: 1 second
  • Backoff factor: 2x per retry
  • Graceful fallback to error response if all retries fail

Parallel Processing

Items can be processed in parallel using ThreadPoolExecutor:

agentic-eval input.json --parallel 4  # Process 4 items concurrently

Configurable Consensus Threshold

The consensus_threshold in task_config controls when consensus is reached:

  • 0.6 = 60% agreement required
  • 1.0 = full consensus required

How It Works

1. Manager Analysis

The manager samples items from the dataset and assesses:

  • Difficulty level (low/medium/high)
  • Recommended strategies
  • Number of discussion rounds

2. Agent Generation

Based on difficulty and task, workers are created with specific roles:

  • strict_evaluator: Focuses on accuracy and correctness
  • creative_reviewer: Looks for edge cases and alternatives
  • domain_expert: Checks technical/domain accuracy (for high difficulty)
  • lenient_reviewer: Evaluates practical acceptability (for high difficulty)

3. Evaluation Loop

For each item:

  1. All workers independently evaluate the labels
  2. Consensus is checked against the threshold
  3. If no consensus, all workers discuss and re-evaluate
  4. After max rounds, manager tie-breaks if still disagreement

4. Output

Each item in the output includes:

  • Original id and text
  • Final labels (possibly modified)
  • evaluation_summary with consensus info

Example Output

{
  "id": "sample_001",
  "text": "The recent advancement in AI...",
  "labels": {
    "category": "technology",
    "sentiment": "positive"
  },
  "evaluation_summary": {
    "consensus": "correct",
    "reasoning": "Threshold reached: correct (2/2 >= 0.6)",
    "resolved_by": "workers",
    "worker_count": 2,
    "discussion_rounds": 0
  }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_eval_team-0.1.0.tar.gz (58.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_eval_team-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file agentic_eval_team-0.1.0.tar.gz.

File metadata

  • Download URL: agentic_eval_team-0.1.0.tar.gz
  • Upload date:
  • Size: 58.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for agentic_eval_team-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bf9808fd4ae0b780e5abae7095551e43043fc5823f22cc9137dbe2893ecd93bb
MD5 201261fd81232c2b935d94f433989277
BLAKE2b-256 c9ad751b7e325a8ee4895eb0c74c8173d7c46c75f05021137e43d37f39693990

See more details on using hashes here.

File details

Details for the file agentic_eval_team-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentic_eval_team-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b77ffce1376d6c7f0a6a7057b81054c6ddddd00eacc6e8646abafa2dded1de0c
MD5 c7576fb3debdf4942b28aa814414dc2a
BLAKE2b-256 465b5533ba2ad7b6f6557a8e2165f5cc7fc9ad9b6ce85c9b4f37d472c926d285

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page