Skip to main content

AgentSim is a simulation based evaluation framework for multi-step agents.

Project description

agentsim

agentsim is a simulation based evaluation framework for multi-step agents.

what is agentsim?

agentsim helps you iterate by:

  1. Generating insightful synthetic trajectories for testing
  2. Evaluating the trajectories to select the one with most interesting results
  3. Boostrapping your Agent architecture for rapid development

why build agentsim?

there are many LLM eval frameworks that are already out there. so why are we building another one?

  1. Synthetic Dataset Generation Toolkit

AI Applications are too open ended. Customers don't know what they want. So we need to simulate and understand where our application breaks.

No framework exists for generating synthetic trajectories. Since agents trajectories are variable, we help you increase your test surface area using synthetic trajectories that you can customize with your ideal user persona.

  1. Maximizing Developer Attention

AI Applications can only be aligned by observing the most anomalous trajectories.

We use statistics to analyze multiple trajectories per run and show you the most interesting one first. Test cases that deviate most from the others or trigger/break evaluations are more interesting for development.

  1. Composable Abstractions

Current frameworks lack the base abstractions to evaluate trajectories and also make it hard to customize the evaluators.

For the beginner, we offer several boostrapped stacks for chatbots and agents. For the novice, we offer customization in Evaluators, App Settings, and Trajectory generators. For the expert, we allow developers to build their own console and simulation logic.

Quickstart

Input:

system_prompt = 'be a good legal assistant'
evaluator = TopicScoreEvaluator('tax credits')
user_profile = 'small business owner looking for tax advice'

run(system_prompt, evaluator, user_profile)

Output:

Total cost: 7.663e-05
Total tokens: 1175
Average TPS: 804.7948882714338
Elapsed time: 1.459999332902953

Scores:  [8, 8, 9, 6, 9]
Mean:  8.0
Most interesting Chat:  6

Chat:

 SYSTEM 

 be a good legal assistant 

 USER 

 Let's get straight to it. I'm the owner of a thriving artisanal bakery in the city, and I'm struggling to navigate the complexities of accounting and tax regulations as I look to expand my business. 

 ASSISTANT 

 I'm happy to help! As your legal assistant, I'll do my best to guide you through the process of balancing your books and staying on the right side of the tax authorities.
...

Let's get started and work together to get your bakery's finances in order! 

 USER 

 Thanks for getting straight to it! I appreciate your no-nonsense approach. Here's the rundown on my bakery's financial situation:

Revenues: I'm doing around $800,000 in annual sales, with a steady growth rate over the past few years.
...

 ASSISTANT 

 Let's dive into the details. 
 ...
Some general guidelines for grant reporting:

1. Reporting timelines: Ensure you're aware of the required reporting deadlines and submit your reports on time.
2. Required documents: Keep all necessary documents, such as receipt of grants, invoices, and expenses related to the grant.
3. Compliance: Ensure compliance with the terms and conditions of the grant, including any restrictions or restrictions on the use of the funds.
4. Tracking: Keep accurate records of all grant-related expenses, budget vs. actual, and budget variances to track grant performance.

Let me know if there's anything else you'd like to discuss or if there are any specific areas you'd like me to focus on. 

Now that you have the intersting trajectory, iterate on your system_prompt, user_profile, or evaluator!

how to use agentsim?

to setup a test, we need to define 3 things:

  • generator: a function that generates the input data
  • scorer: a function that scores the output data
  • evaluator: a function that checks if the scores are in your target range

what's next?

features

  • optimal stopping policy
  • versioning based on repo hash & git hash
  • historical scores visualization
  • mock tool calling servers
  • more templated scorers, evaluators, generators
  • support for multiple models
  • typescript version

docs

  • examples, cookbooks, best practices, documentation
  • metrics and reports

Tests: Alignment ! 😡

App: Alignment ? 🧐

Evals: Alignment = 😇

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentsim-0.0.2.tar.gz (16.4 kB view hashes)

Uploaded Source

Built Distribution

agentsim-0.0.2-py3-none-any.whl (18.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page