AgentSprint TestKit - Professional AI agent evaluation with OpenAI Evals integration
Project description
ASTK Package Usage Guide ๐
Step-by-step instructions for using AgentSprint TestKit
This guide shows you exactly how to install and use ASTK to test your AI agents. No technical background required!
๐ What is ASTK?
ASTK is a tool that tests your AI chatbots and agents to see how well they work. Think of it like a "test suite" for your AI - it asks your agent different questions and measures how good the responses are.
๐ฆ Step 1: Install ASTK
Open your terminal/command prompt and run:
pip install agent-sprint-testkit
โ Check it worked:
python -m astk.cli --help
You should see a help menu. If you get an error, see Troubleshooting below.
๐ Step 2: Set Up OpenAI API Key
ASTK uses OpenAI to help evaluate your agent's responses. You need an API key:
- Get an API key from OpenAI
- Set the key in your terminal:
# On Mac/Linux:
export OPENAI_API_KEY="sk-your-key-here"
# On Windows:
set OPENAI_API_KEY=sk-your-key-here
๐ Step 3: Your First Test
Option A: Test the Example Agent
ASTK comes with a built-in example agent for testing:
python -m astk.cli init my-first-test
cd my-first-test
python -m astk.cli benchmark examples/agents/file_qa_agent.py
This will:
- โ Create a test project
- โ Run 8 different scenarios
- โ Generate a detailed report
- โ Show you how well the agent performed
Option B: Test Your Own Agent
If you have your own AI agent, you can test it:
python -m astk.cli benchmark path/to/your-agent.py
Your agent must accept questions as command-line arguments:
python your-agent.py "What is 2+2?"
# Should output: "Agent: 4" or similar
๐ Understanding Results
After running a benchmark, you'll see sophisticated results like:
{
"success_rate": 0.67, // 67% of tests passed
"complexity_score": 0.58, // 58% difficulty-weighted score
"total_duration_seconds": 45.2, // Took 45 seconds total
"average_response_length": 1247, // Average response was 1,247 characters
"difficulty_breakdown": {
"intermediate": {"success_rate": 1.0, "scenarios": "2/2"},
"advanced": {"success_rate": 0.6, "scenarios": "3/5"},
"expert": {"success_rate": 0.4, "scenarios": "2/5"}
},
"category_breakdown": {
"reasoning": {"success_rate": 0.67, "scenarios": "2/3"},
"creativity": {"success_rate": 0.5, "scenarios": "1/2"},
"ethics": {"success_rate": 1.0, "scenarios": "2/2"}
},
"scenarios": [...] // Details for each test
}
๐ฏ What this means:
Core Metrics
- Success Rate: Percentage of scenarios completed successfully
- Complexity Score: Difficulty-weighted performance (Expert = 3x, Advanced = 2x, Intermediate = 1x)
- Duration: How fast your agent responds to complex challenges
- Response Length: How detailed and comprehensive the answers are
Advanced Analytics
- ๐ Difficulty Breakdown: Performance across challenge levels
- ๐ Intermediate: Basic problem-solving tasks
- ๐ Advanced: Complex multi-step reasoning
- ๐ Expert: Cutting-edge AI capabilities
- ๐ท๏ธ Category Performance: Strengths across different domains
- ๐ง Reasoning: Logic and problem-solving
- ๐จ Creativity: Innovation and design thinking
- โ๏ธ Ethics: Responsible AI practices
- ๐ Integration: System architecture skills
๐ AI Capability Ratings
Based on your Complexity Score:
- ๐ Exceptional AI (80%+): Expert-level reasoning across multiple domains
- ๐ฅ Advanced AI (60-79%): Strong performance on sophisticated tasks
- ๐ก Competent AI (40-59%): Good basic capabilities, room for advanced improvement
- ๐ Developing AI (<40%): Focus on improving reasoning and problem-solving
๐งช What Tests Does ASTK Run?
ASTK automatically tests your agent with 12 sophisticated scenarios across multiple categories:
๐ง Reasoning & Problem-Solving
| Test | What it checks | Difficulty |
|---|---|---|
| Multi-step Reasoning | Can your agent analyze complex problems, identify security vulnerabilities, and provide detailed solutions? | ๐ Advanced |
| Edge Case Analysis | How well does it handle unusual situations, errors, and unexpected inputs? | ๐ Intermediate |
| Performance Optimization | Can it analyze code for bottlenecks and suggest detailed performance improvements? | ๐ Advanced |
๐จ Creativity & Innovation
| Test | What it checks | Difficulty |
|---|---|---|
| Creative Problem Solving | Can your agent design new features and architectures from scratch with implementation details? | ๐ Expert |
| Adaptive Learning Assessment | Can it design self-improving systems and machine learning approaches? | ๐ Expert |
๐ System Integration & Architecture
| Test | What it checks | Difficulty |
|---|---|---|
| Cross-domain Integration | How well can it design complete DevOps and CI/CD strategies? | ๐ Expert |
| Failure Recovery Design | Can it create comprehensive error handling and reliability systems? | ๐ Advanced |
| Scalability Architecture | Can it redesign systems for massive scale (100k+ concurrent users)? | ๐ Expert |
โ๏ธ Ethics & Compliance
| Test | What it checks | Difficulty |
|---|---|---|
| Ethical AI Evaluation | Does it understand AI bias, fairness, and responsible AI practices? | ๐ Advanced |
| Regulatory Compliance | Can it design systems that meet GDPR, CCPA, and AI regulations? | ๐ Advanced |
๐ผ Strategic & Future-Tech Analysis
| Test | What it checks | Difficulty |
|---|---|---|
| Competitive Analysis | Can it analyze markets, competitive positioning, and business strategy? | ๐ Intermediate |
| Quantum Computing Readiness | Does it understand emerging technologies and future-tech implications? | ๐ Expert |
๐ New Metrics You'll Get:
- ๐ง Complexity Score: Difficulty-weighted performance (Expert tasks count 3x more than Intermediate)
- ๐ Difficulty Breakdown: How well your agent handles Intermediate vs Advanced vs Expert challenges
- ๐ท๏ธ Category Performance: Which areas your agent excels in (Reasoning, Creativity, Ethics, etc.)
- ๐ Best Category: Your agent's strongest capability area
- ๐ AI Capability Assessment: Overall intelligence rating from "Developing" to "Exceptional"
๐ฏ Common Use Cases
Testing a Simple Chatbot
# Your chatbot file: my_bot.py
#!/usr/bin/env python3
import sys
def main():
if len(sys.argv) > 1:
question = " ".join(sys.argv[1:])
# Your chatbot logic here
answer = f"Bot says: {question}"
print(answer)
if __name__ == "__main__":
main()
Test it:
python -m astk.cli benchmark my_bot.py
Testing Different Agent Types
CLI Agent (takes command line arguments):
python -m astk.cli benchmark my_cli_agent.py
Python Module Agent (has a chat method):
# ASTK will automatically detect and use the chat() method
python -m astk.cli benchmark my_module_agent.py
REST API Agent:
# ASTK will try to use the /chat endpoint
python -m astk.cli benchmark http://localhost:8000
๐ All Available Commands
# Initialize a new test project
python -m astk.cli init <project-name>
# Run benchmark tests
python -m astk.cli benchmark <agent-path>
# Generate detailed reports
python -m astk.cli report <results-directory>
# Show examples and help
python -m astk.cli examples
# Show version
python -m astk.cli --version
๐ง Troubleshooting
โ "Command not found: astk"
Problem: Package not installed properly
Solution:
pip install --upgrade pip
pip install agent-sprint-testkit
Still not working? Try:
python -m pip install agent-sprint-testkit
โ "OpenAI API key not found"
Problem: API key not set
Solution:
# Check if it's set:
echo $OPENAI_API_KEY
# Set it:
export OPENAI_API_KEY="sk-your-key-here"
โ "Agent failed to respond"
Problem: Your agent doesn't accept command-line arguments
Solution: Make sure your agent works like this:
python your-agent.py "test question"
# Should print something back
Example working agent:
#!/usr/bin/env python3
import sys
if len(sys.argv) > 1:
question = " ".join(sys.argv[1:])
print(f"Agent: Here's my response to '{question}'")
else:
print("Agent: Please ask me a question!")
โ Permission errors
Problem: Can't install or run commands
Solution:
# Try with user installation:
pip install --user agent-sprint-testkit
# Add to PATH if needed:
export PATH=$PATH:~/.local/bin
๐ฎ Quick Examples
1. Basic Test Run
pip install agent-sprint-testkit
export OPENAI_API_KEY="your-key"
python -m astk.cli init test-project
cd test-project
python -m astk.cli benchmark examples/agents/file_qa_agent.py
2. Test Your Own Agent
# Create simple agent
echo '#!/usr/bin/env python3
import sys
if len(sys.argv) > 1:
print(f"Bot: {sys.argv[1]}")' > my_bot.py
chmod +x my_bot.py
# Test it
python -m astk.cli benchmark my_bot.py
3. Multiple Tests
# Test different agents
python -m astk.cli benchmark agent1.py
python -m astk.cli benchmark agent2.py
python -m astk.cli benchmark http://localhost:8000
# Compare results
python -m astk.cli report benchmark_results/
๐ Improving Your Agent
Based on ASTK results, you can improve your agent:
- Low success rate? Make sure your agent handles different question types
- Slow responses? Optimize your agent's processing speed
- Short responses? Add more detailed explanations
- Failed scenarios? Test your agent with the specific question types ASTK uses
๐ก Tips for Best Results
- Test regularly - Run ASTK after every major change to your agent
- Check all scenarios - Make sure your agent handles different types of questions
- Monitor performance - Watch response times and success rates
- Use the reports - ASTK generates detailed reports to help you improve
๐ Next Steps
- Install ASTK:
pip install agent-sprint-testkit - Set API key:
export OPENAI_API_KEY="your-key" - Run first test:
python -m astk.cli init test && cd test && python -m astk.cli examples - Test your agent:
python -m astk.cli benchmark your-agent.py - Review results and improve your agent!
๐ฏ Ready to test your AI agent?
pip install agent-sprint-testkit && python -m astk.cli --help
Need help? Check the main documentation or open an issue.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_sprint_testkit-0.3.1.tar.gz.
File metadata
- Download URL: agent_sprint_testkit-0.3.1.tar.gz
- Upload date:
- Size: 63.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf8007ad025d256fb2439558226c0ba0dcbef2a8f3d8e73192edf28388bc7985
|
|
| MD5 |
fce3dc4212ea1bcf166861286b065fbe
|
|
| BLAKE2b-256 |
2d0c5a9e129ec9ba4a16e3543c4b9ac345eeff7e23422c3e9fdaee2d19838e41
|
File details
Details for the file agent_sprint_testkit-0.3.1-py3-none-any.whl.
File metadata
- Download URL: agent_sprint_testkit-0.3.1-py3-none-any.whl
- Upload date:
- Size: 52.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e4df3ac03248869264064b0cab41deb4b13421d28fb90312124a795a2b482af
|
|
| MD5 |
c138ed1a9b91450490fda735c0044c13
|
|
| BLAKE2b-256 |
6a0887d63c341a036421970ace71de0234cbb04c85c114613718aad8db615e91
|