Skip to main content

SDK to write and run tests for your LLM app

Project description

Magik is an LLM Observability SDK that helps you write tests and monitor your app in production.

Overview

Reliability of output is one of the biggest challenges for people trying to use LLM apps in production.

LLM responses are non-deterministic by nature. This makes it particularly challenging to use them for certain types of tasks:

  • If you're building a AI assistant that helps answer legal questions, and you cannot afford to have hallucinations, or misinformation.
  • If you're building a code generation AI, you might need to make sure the code is correct, and works as expected.
  • If you're building a customer support agent, you might need to make it sure it responds with accurate answers in a specified format, and does not contain sensitive information like PII.

We are trying to solve these problems with a test-driven approach towards LLM observability.


Use Cases

Who is this product meant for?

  • If you're in the early stages of building an LLM app:
  • If you have an LLM app in production

If you're in the early stages of building an LLM app:


Test-driven development can speed up your development very nicely, and can help you engineer your prompts to be more robust.

For example, you can write tests like this:

# Test that output contains none of the restricted keywords
{
    "description": "output does not contain restricted keywords",
    "eval_function": contains_none,
    "vars": {},
    "args": [restricted_keywords],
    "failure_labels": ["contains_restricted_words", "critical"],
},



If you have an LLM app in production:


You can use our evaluation & monitoring platform to:

  • Observe the prompt, response pairs in production, and analyze response times, cost, token usage, etc for different prompts and date ranges.

  • Evaluate your production responses against your own tests to get a quantifiable understanding of how well your LLM app is performing.

    • For example, You can run the tests you defined against the LLM responses you are getting in production to measure how your app is performing with real data.
  • Filter by failure labels, severity, prompt, etc to identify different types of errors that are occurring in your LLM outputs.


Upcoming Features


Soon, you will also be able to:

  • Fail bad outputs before they get to your users.

    • For example, if the LLM response contains sensitive information like PII, you can detect that in real-time, and cut it off before it reaches the end user.
  • Set up alerts to notify you about critical errors in production.



Platform

Contact us at hello@magiklabs.app to get access to our LLM observability platform where you can run the tests you've defined here against your LLM responses in production.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

magik-0.1.2.tar.gz (12.0 kB view hashes)

Uploaded Source

Built Distribution

magik-0.1.2-py3-none-any.whl (24.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page