Skip to main content

AgentSprint TestKit - Professional AI agent evaluation with OpenAI Evals integration

Project description

ASTK Package Usage Guide ๐Ÿ“–

Step-by-step instructions for using AgentSprint TestKit

This guide shows you exactly how to install and use ASTK to test your AI agents. No technical background required!

๐Ÿš€ What is ASTK?

ASTK is a tool that tests your AI chatbots and agents to see how well they work. Think of it like a "test suite" for your AI - it asks your agent different questions and measures how good the responses are.

๐Ÿ“ฆ Step 1: Install ASTK

Open your terminal/command prompt and run:

pip install agent-sprint-testkit

โœ… Check it worked:

python -m astk.cli --help

You should see a help menu. If you get an error, see Troubleshooting below.

๐Ÿ”‘ Step 2: Set Up OpenAI API Key

ASTK uses OpenAI to help evaluate your agent's responses. You need an API key:

  1. Get an API key from OpenAI
  2. Set the key in your terminal:
# On Mac/Linux:
export OPENAI_API_KEY="sk-your-key-here"

# On Windows:
set OPENAI_API_KEY=sk-your-key-here

๐Ÿ Step 3: Your First Test

Option A: Test the Example Agent

ASTK comes with a built-in example agent for testing:

python -m astk.cli init my-first-test
cd my-first-test
python -m astk.cli benchmark examples/agents/file_qa_agent.py

This will:

  • โœ… Create a test project
  • โœ… Run 8 different scenarios
  • โœ… Generate a detailed report
  • โœ… Show you how well the agent performed

Option B: Test Your Own Agent

If you have your own AI agent, you can test it:

python -m astk.cli benchmark path/to/your-agent.py

Your agent must accept questions as command-line arguments:

python your-agent.py "What is 2+2?"
# Should output: "Agent: 4" or similar

๐Ÿ“Š Understanding Results

After running a benchmark, you'll see sophisticated results like:

{
  "success_rate": 0.67,           // 67% of tests passed
  "complexity_score": 0.58,       // 58% difficulty-weighted score
  "total_duration_seconds": 45.2, // Took 45 seconds total
  "average_response_length": 1247, // Average response was 1,247 characters
  "difficulty_breakdown": {
    "intermediate": {"success_rate": 1.0, "scenarios": "2/2"},
    "advanced": {"success_rate": 0.6, "scenarios": "3/5"},
    "expert": {"success_rate": 0.4, "scenarios": "2/5"}
  },
  "category_breakdown": {
    "reasoning": {"success_rate": 0.67, "scenarios": "2/3"},
    "creativity": {"success_rate": 0.5, "scenarios": "1/2"},
    "ethics": {"success_rate": 1.0, "scenarios": "2/2"}
  },
  "scenarios": [...]              // Details for each test
}

๐ŸŽฏ What this means:

Core Metrics

  • Success Rate: Percentage of scenarios completed successfully
  • Complexity Score: Difficulty-weighted performance (Expert = 3x, Advanced = 2x, Intermediate = 1x)
  • Duration: How fast your agent responds to complex challenges
  • Response Length: How detailed and comprehensive the answers are

Advanced Analytics

  • ๐ŸŽ“ Difficulty Breakdown: Performance across challenge levels
    • ๐Ÿ“˜ Intermediate: Basic problem-solving tasks
    • ๐Ÿ“™ Advanced: Complex multi-step reasoning
    • ๐Ÿ“• Expert: Cutting-edge AI capabilities
  • ๐Ÿท๏ธ Category Performance: Strengths across different domains
    • ๐Ÿง  Reasoning: Logic and problem-solving
    • ๐ŸŽจ Creativity: Innovation and design thinking
    • โš–๏ธ Ethics: Responsible AI practices
    • ๐Ÿ”— Integration: System architecture skills

๐ŸŒŸ AI Capability Ratings

Based on your Complexity Score:

  • ๐ŸŒŸ Exceptional AI (80%+): Expert-level reasoning across multiple domains
  • ๐Ÿ”ฅ Advanced AI (60-79%): Strong performance on sophisticated tasks
  • ๐Ÿ’ก Competent AI (40-59%): Good basic capabilities, room for advanced improvement
  • ๐Ÿ“š Developing AI (<40%): Focus on improving reasoning and problem-solving

๐Ÿงช What Tests Does ASTK Run?

ASTK automatically tests your agent with 12 sophisticated scenarios across multiple categories:

๐Ÿง  Reasoning & Problem-Solving

Test What it checks Difficulty
Multi-step Reasoning Can your agent analyze complex problems, identify security vulnerabilities, and provide detailed solutions? ๐Ÿ“™ Advanced
Edge Case Analysis How well does it handle unusual situations, errors, and unexpected inputs? ๐Ÿ“˜ Intermediate
Performance Optimization Can it analyze code for bottlenecks and suggest detailed performance improvements? ๐Ÿ“™ Advanced

๐ŸŽจ Creativity & Innovation

Test What it checks Difficulty
Creative Problem Solving Can your agent design new features and architectures from scratch with implementation details? ๐Ÿ“• Expert
Adaptive Learning Assessment Can it design self-improving systems and machine learning approaches? ๐Ÿ“• Expert

๐Ÿ”— System Integration & Architecture

Test What it checks Difficulty
Cross-domain Integration How well can it design complete DevOps and CI/CD strategies? ๐Ÿ“• Expert
Failure Recovery Design Can it create comprehensive error handling and reliability systems? ๐Ÿ“™ Advanced
Scalability Architecture Can it redesign systems for massive scale (100k+ concurrent users)? ๐Ÿ“• Expert

โš–๏ธ Ethics & Compliance

Test What it checks Difficulty
Ethical AI Evaluation Does it understand AI bias, fairness, and responsible AI practices? ๐Ÿ“™ Advanced
Regulatory Compliance Can it design systems that meet GDPR, CCPA, and AI regulations? ๐Ÿ“™ Advanced

๐Ÿ’ผ Strategic & Future-Tech Analysis

Test What it checks Difficulty
Competitive Analysis Can it analyze markets, competitive positioning, and business strategy? ๐Ÿ“˜ Intermediate
Quantum Computing Readiness Does it understand emerging technologies and future-tech implications? ๐Ÿ“• Expert

๐Ÿ“Š New Metrics You'll Get:

  • ๐Ÿง  Complexity Score: Difficulty-weighted performance (Expert tasks count 3x more than Intermediate)
  • ๐ŸŽ“ Difficulty Breakdown: How well your agent handles Intermediate vs Advanced vs Expert challenges
  • ๐Ÿท๏ธ Category Performance: Which areas your agent excels in (Reasoning, Creativity, Ethics, etc.)
  • ๐Ÿ† Best Category: Your agent's strongest capability area
  • ๐ŸŒŸ AI Capability Assessment: Overall intelligence rating from "Developing" to "Exceptional"

๐ŸŽฏ Common Use Cases

Testing a Simple Chatbot

# Your chatbot file: my_bot.py
#!/usr/bin/env python3
import sys

def main():
    if len(sys.argv) > 1:
        question = " ".join(sys.argv[1:])
        # Your chatbot logic here
        answer = f"Bot says: {question}"
        print(answer)

if __name__ == "__main__":
    main()

Test it:

python -m astk.cli benchmark my_bot.py

Testing Different Agent Types

CLI Agent (takes command line arguments):

python -m astk.cli benchmark my_cli_agent.py

Python Module Agent (has a chat method):

# ASTK will automatically detect and use the chat() method
python -m astk.cli benchmark my_module_agent.py

REST API Agent:

# ASTK will try to use the /chat endpoint
python -m astk.cli benchmark http://localhost:8000

๐Ÿ“‹ All Available Commands

# Initialize a new test project
python -m astk.cli init <project-name>

# Run benchmark tests
python -m astk.cli benchmark <agent-path>

# Generate detailed reports
python -m astk.cli report <results-directory>

# Show examples and help
python -m astk.cli examples

# Show version
python -m astk.cli --version

๐Ÿ”ง Troubleshooting

โŒ "Command not found: astk"

Problem: Package not installed properly

Solution:

pip install --upgrade pip
pip install agent-sprint-testkit

Still not working? Try:

python -m pip install agent-sprint-testkit

โŒ "OpenAI API key not found"

Problem: API key not set

Solution:

# Check if it's set:
echo $OPENAI_API_KEY

# Set it:
export OPENAI_API_KEY="sk-your-key-here"

โŒ "Agent failed to respond"

Problem: Your agent doesn't accept command-line arguments

Solution: Make sure your agent works like this:

python your-agent.py "test question"
# Should print something back

Example working agent:

#!/usr/bin/env python3
import sys

if len(sys.argv) > 1:
    question = " ".join(sys.argv[1:])
    print(f"Agent: Here's my response to '{question}'")
else:
    print("Agent: Please ask me a question!")

โŒ Permission errors

Problem: Can't install or run commands

Solution:

# Try with user installation:
pip install --user agent-sprint-testkit

# Add to PATH if needed:
export PATH=$PATH:~/.local/bin

๐ŸŽฎ Quick Examples

1. Basic Test Run

pip install agent-sprint-testkit
export OPENAI_API_KEY="your-key"
python -m astk.cli init test-project
cd test-project
python -m astk.cli benchmark examples/agents/file_qa_agent.py

2. Test Your Own Agent

# Create simple agent
echo '#!/usr/bin/env python3
import sys
if len(sys.argv) > 1:
    print(f"Bot: {sys.argv[1]}")' > my_bot.py

chmod +x my_bot.py

# Test it
python -m astk.cli benchmark my_bot.py

3. Multiple Tests

# Test different agents
python -m astk.cli benchmark agent1.py
python -m astk.cli benchmark agent2.py
python -m astk.cli benchmark http://localhost:8000

# Compare results
python -m astk.cli report benchmark_results/

๐Ÿ“ˆ Improving Your Agent

Based on ASTK results, you can improve your agent:

  • Low success rate? Make sure your agent handles different question types
  • Slow responses? Optimize your agent's processing speed
  • Short responses? Add more detailed explanations
  • Failed scenarios? Test your agent with the specific question types ASTK uses

๐Ÿ’ก Tips for Best Results

  1. Test regularly - Run ASTK after every major change to your agent
  2. Check all scenarios - Make sure your agent handles different types of questions
  3. Monitor performance - Watch response times and success rates
  4. Use the reports - ASTK generates detailed reports to help you improve

๐Ÿš€ Next Steps

  1. Install ASTK: pip install agent-sprint-testkit
  2. Set API key: export OPENAI_API_KEY="your-key"
  3. Run first test: python -m astk.cli init test && cd test && python -m astk.cli examples
  4. Test your agent: python -m astk.cli benchmark your-agent.py
  5. Review results and improve your agent!

๐ŸŽฏ Ready to test your AI agent?

pip install agent-sprint-testkit && python -m astk.cli --help

Need help? Check the main documentation or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_sprint_testkit-0.3.1.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_sprint_testkit-0.3.1-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file agent_sprint_testkit-0.3.1.tar.gz.

File metadata

  • Download URL: agent_sprint_testkit-0.3.1.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for agent_sprint_testkit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 bf8007ad025d256fb2439558226c0ba0dcbef2a8f3d8e73192edf28388bc7985
MD5 fce3dc4212ea1bcf166861286b065fbe
BLAKE2b-256 2d0c5a9e129ec9ba4a16e3543c4b9ac345eeff7e23422c3e9fdaee2d19838e41

See more details on using hashes here.

File details

Details for the file agent_sprint_testkit-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_sprint_testkit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8e4df3ac03248869264064b0cab41deb4b13421d28fb90312124a795a2b482af
MD5 c138ed1a9b91450490fda735c0044c13
BLAKE2b-256 6a0887d63c341a036421970ace71de0234cbb04c85c114613718aad8db615e91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page