A multi-agent system for evaluating and refining labels on text data

These details have not been verified by PyPI

Project description

AgentCompany Label Evaluation System

A multi-agent system for evaluating and refining labels on text data using the smolagent framework. The system employs a manager-worker architecture where multiple agents collaborate to reach consensus on label quality.

Features

Manager Agent: Analyzes dataset, assesses difficulty, creates worker agents with specific roles
Worker Agents: Evaluate labels with diverse personas and perspectives
Consensus Engine: Multiple strategies for resolving disagreements
Dynamic Agent Generation: Manager decides how many workers to create and their specific roles
Hybrid Resolution: Discussion first, then manager tie-break if needed
Pydantic Validation: Type-safe input/output with Pydantic models
Retry Logic: Automatic retry with exponential backoff for API calls
Parallel Processing: Process multiple items concurrently with ThreadPoolExecutor

Supported Task Types

Classification (sentiment, category, etc.)
Translation Selection (choosing best translation)
Keyword Extraction
Custom label types (user-defined)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Input JSON File                      │
│  { "task_config": {...}, "items": [...] }               │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  MANAGER AGENT (smolagent)              │
│  • Samples items to assess difficulty                  │
│  • Creates worker agents with specific roles           │
│  • Plans evaluation strategy                           │
│  • Tie-breaker when consensus fails                    │
└─────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌───────────┐   ┌───────────┐   ┌───────────┐
    │  Worker 1 │   │  Worker 2 │   │  Worker N │
    │ (strict)  │◄──┤ (creative)│◄──┤ (domain)  │
    └───────────┘   └───────────┘   └───────────┘
          │               │               │
          └───────────────┴───────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  CONSENSUS ENGINE                       │
│  • Check consensus threshold (configurable, default 60%)│
│  • Resolve via majority vote, union, or discussion     │
│  • Manager tie-break if still disagreement             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Output JSON File                      │
│  [{ "id": "...", "labels": {...}, "evaluation": {...} }]│
└─────────────────────────────────────────────────────────┘

Installation

From PyPI

pip install agentic-eval-team

From source

pip install -e .

Requirements

Python 3.10+
smolagents
openai
pydantic
tqdm

Quick Start

1. Prepare Input File

Create an input JSON file with task_config and items:

{
  "task_config": {
    "description": "Evaluate whether the assigned categories and sentiments are accurate",
    "evaluation_criteria": ["accuracy", "consistency", "completeness"],
    "consensus_strategy": "discussion_then_vote",
    "consensus_threshold": 0.6,
    "max_discussion_rounds": 2
  },
  "items": [
    {
      "id": "sample_001",
      "text": "The recent advancement in AI has revolutionized healthcare diagnostics.",
      "labels": {
        "category": "technology",
        "sentiment": "positive"
      }
    }
  ]
}

2. Run Evaluation

# Using the CLI command (after installation)
agentic-eval input.json -o output.json --mock

# Or using Python module
python -m agentic_eval_team input.json -o output.json --mock

# With a real model
agentic-eval input.json -o output.json --endpoint http://localhost:8000/v1 --model llama-3.1-8b

Configuration

Command Line Arguments

Argument	Description
`input`	Input JSON file path (required)
`-o, --output`	Output JSON file path
`--endpoint`	vLLM/OpenAI-compatible endpoint
`--model`	Model identifier
`--mock`	Use mock model for testing
`--parallel`	Max parallel workers for item processing (default: 4)

Task Config Options

Field	Default	Description
`description`	"Evaluate labels for accuracy and quality"	Task description
`evaluation_criteria`	["accuracy", "reasonableness"]	What to evaluate
`consensus_strategy`	"discussion_then_vote"	Resolution strategy
`consensus_threshold`	0.6	Threshold for consensus (0.0-1.0)
`max_discussion_rounds`	2	Max rounds before tie-break

Consensus Strategies

Strategy	Description
`majority_vote`	Simple majority wins
`discussion_then_vote`	Discuss until threshold agreement, then vote
`full_consensus`	Require unanimous agreement
`union`	Combine all extractions (for keywords)

Project Structure

agentic-eval-team/
├── agentic_eval_team/           # Main package
│   ├── __init__.py
│   ├── __main__.py            # CLI entry point
│   ├── config.py              # Configuration
│   ├── models/
│   │   ├── manager.py         # Manager agent
│   │   ├── worker.py          # Worker agents (with retry logic)
│   │   ├── tools.py           # Manager tools
│   │   ├── schema.py          # Pydantic models
│   │   └── mock_model.py      # Mock model for testing
│   ├── consensus/
│   │   ├── engine.py          # Consensus orchestration
│   │   └── strategies.py      # Resolution strategies
│   ├── evaluation/
│   │   └── runner.py          # Parallel processing runner
│   ├── tasks/
│   │   ├── router.py          # Task type detection
│   │   └── prompts.py          # Prompt templates
│   └── utils/
│       ├── io.py              # JSON I/O utilities
│       ├── retry.py           # Retry decorator
│       └── errors.py          # Custom exceptions
├── samples/
│   └── input_sample.json      # Sample input
├── tests/
│   └── test_core.py           # Unit tests
├── pyproject.toml            # Package configuration
└── README.md

Testing

python -m unittest discover tests -v

Mock Testing

Use --mock flag to test without a running LLM server:

agentic-eval samples/input_sample.json -o output.json --mock --parallel 2

Key Improvements

Retry Logic

Worker agents automatically retry failed API calls with exponential backoff:

Max retries: 3 (configurable)
Initial delay: 1 second
Backoff factor: 2x per retry
Graceful fallback to error response if all retries fail

Parallel Processing

Items can be processed in parallel using ThreadPoolExecutor:

agentic-eval input.json --parallel 4  # Process 4 items concurrently

Configurable Consensus Threshold

The consensus_threshold in task_config controls when consensus is reached:

0.6 = 60% agreement required
1.0 = full consensus required

How It Works

1. Manager Analysis

The manager samples items from the dataset and assesses:

Difficulty level (low/medium/high)
Recommended strategies
Number of discussion rounds

2. Agent Generation

Based on difficulty and task, workers are created with specific roles:

strict_evaluator: Focuses on accuracy and correctness
creative_reviewer: Looks for edge cases and alternatives
domain_expert: Checks technical/domain accuracy (for high difficulty)
lenient_reviewer: Evaluates practical acceptability (for high difficulty)

3. Evaluation Loop

For each item:

All workers independently evaluate the labels
Consensus is checked against the threshold
If no consensus, all workers discuss and re-evaluate
After max rounds, manager tie-breaks if still disagreement

4. Output

Each item in the output includes:

Original id and text
Final labels (possibly modified)
evaluation_summary with consensus info

Example Output

{
  "id": "sample_001",
  "text": "The recent advancement in AI...",
  "labels": {
    "category": "technology",
    "sentiment": "positive"
  },
  "evaluation_summary": {
    "consensus": "correct",
    "reasoning": "Threshold reached: correct (2/2 >= 0.6)",
    "resolved_by": "workers",
    "worker_count": 2,
    "discussion_rounds": 0
  }
}

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_eval_team-0.1.0.tar.gz (58.9 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_eval_team-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file agentic_eval_team-0.1.0.tar.gz.

File metadata

Download URL: agentic_eval_team-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 58.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for agentic_eval_team-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bf9808fd4ae0b780e5abae7095551e43043fc5823f22cc9137dbe2893ecd93bb`
MD5	`201261fd81232c2b935d94f433989277`
BLAKE2b-256	`c9ad751b7e325a8ee4895eb0c74c8173d7c46c75f05021137e43d37f39693990`

See more details on using hashes here.

File details

Details for the file agentic_eval_team-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentic_eval_team-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for agentic_eval_team-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b77ffce1376d6c7f0a6a7057b81054c6ddddd00eacc6e8646abafa2dded1de0c`
MD5	`c7576fb3debdf4942b28aa814414dc2a`
BLAKE2b-256	`465b5533ba2ad7b6f6557a8e2165f5cc7fc9ad9b6ce85c9b4f37d472c926d285`

See more details on using hashes here.

agentic-eval-team 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AgentCompany Label Evaluation System

Features

Supported Task Types

Architecture

Installation

From PyPI

From source

Requirements

Quick Start

1. Prepare Input File

2. Run Evaluation

Configuration

Command Line Arguments

Task Config Options

Consensus Strategies

Project Structure

Testing

Mock Testing

Key Improvements

Retry Logic

Parallel Processing

Configurable Consensus Threshold

How It Works

1. Manager Analysis

2. Agent Generation

3. Evaluation Loop

4. Output

Example Output

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes