Skip to main content

AgentSim is a simulation based evaluation framework for multi-step agents.

Project description

agentsim

agentsim is a simulation based evaluation framework for multi-step agents.

what is agentsim?

agentsim helps you iterate by:

  1. Generating insightful synthetic trajectories for testing
  2. Evaluating the trajectories to select the one with most interesting results
  3. Boostrapping your Agent architecture for rapid development

why build agentsim?

there are many LLM eval frameworks that are already out there. so why are we building another one?

  1. Synthetic Dataset Generation Toolkit

AI Applications are too open ended. Customers don't know what they want. So we need to simulate and understand where our application breaks.

No framework exists for generating synthetic trajectories. Since agents trajectories are variable, we help you increase your test surface area using synthetic trajectories that you can customize with your ideal user persona.

  1. Maximizing Developer Attention

AI Applications can only be aligned by observing the most anomalous trajectories.

We use statistics to analyze multiple trajectories per run and show you the most interesting one first. Test cases that deviate most from the others or trigger/break evaluations are more interesting for development.

  1. Composable Abstractions

Current frameworks lack the base abstractions to evaluate trajectories and also make it hard to customize the evaluators.

For the beginner, we offer several boostrapped stacks for chatbots and agents. For the novice, we offer customization in Evaluators, App Settings, and Trajectory generators. For the expert, we allow developers to build their own console and simulation logic.

Quickstart

Input:

system_prompt = 'be a good legal assistant'
evaluator = TopicScoreEvaluator('tax credits')
user_profile = 'small business owner looking for tax advice'

run(system_prompt, evaluator, user_profile)

Output:

Total cost: 7.663e-05
Total tokens: 1175
Average TPS: 804.7948882714338
Elapsed time: 1.459999332902953

Scores:  [8, 8, 9, 6, 9]
Mean:  8.0
Most interesting Chat:  6

Chat:

 SYSTEM 

 be a good legal assistant 

 USER 

 Let's get straight to it. I'm the owner of a thriving artisanal bakery in the city, and I'm struggling to navigate the complexities of accounting and tax regulations as I look to expand my business. 

 ASSISTANT 

 I'm happy to help! As your legal assistant, I'll do my best to guide you through the process of balancing your books and staying on the right side of the tax authorities.
...

Let's get started and work together to get your bakery's finances in order! 

 USER 

 Thanks for getting straight to it! I appreciate your no-nonsense approach. Here's the rundown on my bakery's financial situation:

Revenues: I'm doing around $800,000 in annual sales, with a steady growth rate over the past few years.
...

 ASSISTANT 

 Let's dive into the details. 
 ...
Some general guidelines for grant reporting:

1. Reporting timelines: Ensure you're aware of the required reporting deadlines and submit your reports on time.
2. Required documents: Keep all necessary documents, such as receipt of grants, invoices, and expenses related to the grant.
3. Compliance: Ensure compliance with the terms and conditions of the grant, including any restrictions or restrictions on the use of the funds.
4. Tracking: Keep accurate records of all grant-related expenses, budget vs. actual, and budget variances to track grant performance.

Let me know if there's anything else you'd like to discuss or if there are any specific areas you'd like me to focus on. 

Now that you have the intersting trajectory, iterate on your system_prompt, user_profile, or evaluator!

how to use agentsim?

to setup a test, we need to define 3 things:

  • generator: a function that generates the input data
  • scorer: a function that scores the output data
  • evaluator: a function that checks if the scores are in your target range

what's next?

features

  • optimal stopping policy
  • versioning based on repo hash & git hash
  • historical scores visualization
  • mock tool calling servers
  • more templated scorers, evaluators, generators
  • support for multiple models
  • typescript version

docs

  • examples, cookbooks, best practices, documentation
  • metrics and reports

Tests: Alignment ! 😡

App: Alignment ? 🧐

Evals: Alignment = 😇

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentsim-0.0.2.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

agentsim-0.0.2-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file agentsim-0.0.2.tar.gz.

File metadata

  • Download URL: agentsim-0.0.2.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.0

File hashes

Hashes for agentsim-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2c56878747a0ae785df9058c7f60d397f519c0137417a60bac490dca147827ba
MD5 e5e988d6dc3656cdee54ced0bae35b4a
BLAKE2b-256 245df8f7877a415b7571748957d1f4502d697d7469b928decb6df177be3d1eb0

See more details on using hashes here.

File details

Details for the file agentsim-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: agentsim-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.0

File hashes

Hashes for agentsim-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 96fa726f93c58fe736ac3a546ad48d85b11557e88b307a08be3f83a81d230486
MD5 85f476748c2b17c76ede14d934fd21ed
BLAKE2b-256 50acff7ce9c835da84c4badae7715045fbfd5b672de0bcc2f8dc5e1696545461

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page