A multi-agent system for evaluating and refining labels on text data
Project description
AgentCompany Label Evaluation System
A multi-agent system for evaluating and refining labels on text data using the smolagent framework. The system employs a manager-worker architecture where multiple agents collaborate to reach consensus on label quality.
Features
- Manager Agent: Analyzes dataset, assesses difficulty, creates worker agents with specific roles
- Worker Agents: Evaluate labels with diverse personas and perspectives
- Consensus Engine: Multiple strategies for resolving disagreements
- Dynamic Agent Generation: Manager decides how many workers to create and their specific roles
- Hybrid Resolution: Discussion first, then manager tie-break if needed
- Pydantic Validation: Type-safe input/output with Pydantic models
- Retry Logic: Automatic retry with exponential backoff for API calls
- Parallel Processing: Process multiple items concurrently with
ThreadPoolExecutor
Supported Task Types
- Classification (sentiment, category, etc.)
- Translation Selection (choosing best translation)
- Keyword Extraction
- Custom label types (user-defined)
Architecture
┌─────────────────────────────────────────────────────────┐
│ Input JSON File │
│ { "task_config": {...}, "items": [...] } │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ MANAGER AGENT (smolagent) │
│ • Samples items to assess difficulty │
│ • Creates worker agents with specific roles │
│ • Plans evaluation strategy │
│ • Tie-breaker when consensus fails │
└─────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│ (strict) │◄──┤ (creative)│◄──┤ (domain) │
└───────────┘ └───────────┘ └───────────┘
│ │ │
└───────────────┴───────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ CONSENSUS ENGINE │
│ • Check consensus threshold (configurable, default 60%)│
│ • Resolve via majority vote, union, or discussion │
│ • Manager tie-break if still disagreement │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Output JSON File │
│ [{ "id": "...", "labels": {...}, "evaluation": {...} }]│
└─────────────────────────────────────────────────────────┘
Installation
From PyPI
pip install agentic-eval-team
From source
pip install -e .
Requirements
- Python 3.10+
- smolagents
- openai
- pydantic
- tqdm
Quick Start
1. Prepare Input File
Create an input JSON file with task_config and items:
{
"task_config": {
"description": "Evaluate whether the assigned categories and sentiments are accurate",
"evaluation_criteria": ["accuracy", "consistency", "completeness"],
"consensus_strategy": "discussion_then_vote",
"consensus_threshold": 0.6,
"max_discussion_rounds": 2
},
"items": [
{
"id": "sample_001",
"text": "The recent advancement in AI has revolutionized healthcare diagnostics.",
"labels": {
"category": "technology",
"sentiment": "positive"
}
}
]
}
2. Run Evaluation
# Using the CLI command (after installation)
agentic-eval input.json -o output.json --mock
# Or using Python module
python -m agentic_eval_team input.json -o output.json --mock
# With a real model
agentic-eval input.json -o output.json --endpoint http://localhost:8000/v1 --model llama-3.1-8b
Configuration
Command Line Arguments
| Argument | Description |
|---|---|
input |
Input JSON file path (required) |
-o, --output |
Output JSON file path |
--endpoint |
vLLM/OpenAI-compatible endpoint |
--model |
Model identifier |
--mock |
Use mock model for testing |
--parallel |
Max parallel workers for item processing (default: 4) |
Task Config Options
| Field | Default | Description |
|---|---|---|
description |
"Evaluate labels for accuracy and quality" | Task description |
evaluation_criteria |
["accuracy", "reasonableness"] | What to evaluate |
consensus_strategy |
"discussion_then_vote" | Resolution strategy |
consensus_threshold |
0.6 | Threshold for consensus (0.0-1.0) |
max_discussion_rounds |
2 | Max rounds before tie-break |
Consensus Strategies
| Strategy | Description |
|---|---|
majority_vote |
Simple majority wins |
discussion_then_vote |
Discuss until threshold agreement, then vote |
full_consensus |
Require unanimous agreement |
union |
Combine all extractions (for keywords) |
Project Structure
agentic-eval-team/
├── agentic_eval_team/ # Main package
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── config.py # Configuration
│ ├── models/
│ │ ├── manager.py # Manager agent
│ │ ├── worker.py # Worker agents (with retry logic)
│ │ ├── tools.py # Manager tools
│ │ ├── schema.py # Pydantic models
│ │ └── mock_model.py # Mock model for testing
│ ├── consensus/
│ │ ├── engine.py # Consensus orchestration
│ │ └── strategies.py # Resolution strategies
│ ├── evaluation/
│ │ └── runner.py # Parallel processing runner
│ ├── tasks/
│ │ ├── router.py # Task type detection
│ │ └── prompts.py # Prompt templates
│ └── utils/
│ ├── io.py # JSON I/O utilities
│ ├── retry.py # Retry decorator
│ └── errors.py # Custom exceptions
├── samples/
│ └── input_sample.json # Sample input
├── tests/
│ └── test_core.py # Unit tests
├── pyproject.toml # Package configuration
└── README.md
Testing
python -m unittest discover tests -v
Mock Testing
Use --mock flag to test without a running LLM server:
agentic-eval samples/input_sample.json -o output.json --mock --parallel 2
Key Improvements
Retry Logic
Worker agents automatically retry failed API calls with exponential backoff:
- Max retries: 3 (configurable)
- Initial delay: 1 second
- Backoff factor: 2x per retry
- Graceful fallback to error response if all retries fail
Parallel Processing
Items can be processed in parallel using ThreadPoolExecutor:
agentic-eval input.json --parallel 4 # Process 4 items concurrently
Configurable Consensus Threshold
The consensus_threshold in task_config controls when consensus is reached:
- 0.6 = 60% agreement required
- 1.0 = full consensus required
How It Works
1. Manager Analysis
The manager samples items from the dataset and assesses:
- Difficulty level (low/medium/high)
- Recommended strategies
- Number of discussion rounds
2. Agent Generation
Based on difficulty and task, workers are created with specific roles:
strict_evaluator: Focuses on accuracy and correctnesscreative_reviewer: Looks for edge cases and alternativesdomain_expert: Checks technical/domain accuracy (for high difficulty)lenient_reviewer: Evaluates practical acceptability (for high difficulty)
3. Evaluation Loop
For each item:
- All workers independently evaluate the labels
- Consensus is checked against the threshold
- If no consensus, all workers discuss and re-evaluate
- After max rounds, manager tie-breaks if still disagreement
4. Output
Each item in the output includes:
- Original
idandtext - Final
labels(possibly modified) evaluation_summarywith consensus info
Example Output
{
"id": "sample_001",
"text": "The recent advancement in AI...",
"labels": {
"category": "technology",
"sentiment": "positive"
},
"evaluation_summary": {
"consensus": "correct",
"reasoning": "Threshold reached: correct (2/2 >= 0.6)",
"resolved_by": "workers",
"worker_count": 2,
"discussion_rounds": 0
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentic_eval_team-0.1.0.tar.gz.
File metadata
- Download URL: agentic_eval_team-0.1.0.tar.gz
- Upload date:
- Size: 58.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf9808fd4ae0b780e5abae7095551e43043fc5823f22cc9137dbe2893ecd93bb
|
|
| MD5 |
201261fd81232c2b935d94f433989277
|
|
| BLAKE2b-256 |
c9ad751b7e325a8ee4895eb0c74c8173d7c46c75f05021137e43d37f39693990
|
File details
Details for the file agentic_eval_team-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentic_eval_team-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b77ffce1376d6c7f0a6a7057b81054c6ddddd00eacc6e8646abafa2dded1de0c
|
|
| MD5 |
c7576fb3debdf4942b28aa814414dc2a
|
|
| BLAKE2b-256 |
465b5533ba2ad7b6f6557a8e2165f5cc7fc9ad9b6ce85c9b4f37d472c926d285
|