Skip to main content

Systematic Chain Improvement and Problem Evaluation

Project description

SCIPE - Systematic Chain Improvement and Problem Evaluation

It helps you find bad nodes in LLM chains.

SCIPE is a powerful tool for evaluating and diagnosing LLM (Large Language Model) graphs or chains. It assesses LLM responses and employs a custom algorithm to identify problematic nodes within the LLM chain.

Features

  • Evaluates LLM responses within simple LLM Graphs (mainly LangGraph)
  • Diagnoses problematic nodes in LLM graphs
  • Provides failure rates of various nodes that make up the LLM chain/graph
  • Supports various LLM frameworks (uses LiteLLM underneath the hood)

Why Use SCIPE?

As AI application developers, we often overlook the critical step of evaluating LLM chains during the building phase. SCIPE simplifies this process by allowing developers to run their minimum set of prompts and responses (we recommend atleast 10 examples) through the tool. Within minutes, SCIPE reports back the problematic node in the LLM graph, enabling rapid identification and resolution of issues.

Installation

pip install scipe

Getting Started

You should have a compiled graph (from Langgraph) that you've been using for your LLM application. We'll use the nodes and edges of this graph soon. We also have a couple of examples in the examples_data folder for you to try out.

We'll read the saved (and compiled) Langgraph using the following and convert the format to a simpler DAG which we'll feed into SCIPE.

from scipe.middleware import convert_edges_to_dag

with open("graph-healthcare.json", 'r') as f:
    example_graph = json.load(f)['edges'] # We only need the edges

example_graph = convert_edges_to_dag(example_graph)
from scipe import LLMEvaluator

evaluator = LLMEvaluator(
  config_path="config.yml",
  responses=data,
  graph=example_graph
)

results = evaluator.run_validation().find_problematic_node()

The run_validation() runs LLM-as-judge on input/output pairs and find_problematic_node() method traverses through the graph to figure out which node has the highest failure rate. Once it finds the problematic node, the algorithm stops and returns the result.

You can look at the results of the algorithm.

results.to_json()
Output: 

{'root_cause': 'pii_insurance',
 'debug_path': ['summarizer', 'extractor', 'pii_insurance'],
 'node_results': {'summarizer': {'overall_failure_probability': 0.361,
   'independent_failure_probability': 0.329,
   'conditional_failure_probabilities': {'extractor': 0.476},
   'dependencies': ['extractor'],
   'is_root_cause': False},
  'extractor': {'overall_failure_probability': 0.219,
   'independent_failure_probability': 0.191,
   'conditional_failure_probabilities': {'pii_insurance': 0.259},
   'dependencies': ['pii_insurance'],
   'is_root_cause': False},
  'pii_insurance': {'overall_failure_probability': 0.27,
   'independent_failure_probability': 0.285,
   'conditional_failure_probabilities': {'pii_medications': 0.233},
   'dependencies': ['pii_medications'],
   'is_root_cause': True}}}

Configuration

SCIPE uses a YAML configuration file to set up your LLM graph evaluation. Here's an example of what your config.yaml might look like:

# Example config.yaml

# Where to save the LLM as judge validations for further analysis
PATH_TO_SAVE_VALIDATIONS: "validations.csv"

# Mode name to use for LLM validations
MODEL_NAME: claude-3-haiku-20240307

# Each node name, input and output columns must match the application responses
node_input_output_mappings:
  pii_name_number_email:
    - prompt-1
    - response-1
  pii_id:
    - prompt-1
    - response-2
  pii_birthdate:
    - prompt-1
    - response-3
  pii_medications:
    - prompt-1
    - response-4
  pii_insurance:
    - prompt-1
    - response-5
  extractor:
    - prompt-1
    - response-6
  summarizer:
    - prompt-1
    - response-7

How it works

SCIPE works by analyzing the failure probabilities of nodes in your application graph to identify the most impactful source of failures. The core problem it addresses is:

What node's failures have the biggest impact on the most downstream node's failures?

Here's a breakdown of how SCIPE approaches this problem:

  1. LLM as Judge: SCIPE first uses an LLM as a judge to evaluate each node in the application graph:

    • For each node, it constructs a prompt using the node's input and output.
    • The LLM judge then evaluates whether the node's output is valid given its input.
    • This process generates a dataset of node evaluations across a sample of inputs.
  2. Failure Analysis: For every node, SCIPE recognizes that failures can occur due to two main reasons:

    • Independent failures: The node itself (or the LLM processing it) is the primary cause of the failure.
    • Dependent failures: The node fails because one or more of its dependencies have failed, causing a ripple effect.
  3. Root Cause Analysis: SCIPE then employs an algorithm to identify the root cause of failures. Here's a high-level pseudocode of the algorithm:

    function find_root_cause(node, data, graph):
        calculate probabilities for node (overall, independent, and dependent)
        if node has no dependencies or independent failure probability is highest:
            mark node as root cause
            return node
        else:
            find dependency with highest conditional failure probability
            recursively call find_root_cause on that dependency
    
    function find_problematic_node(data, graph):
        identify the most downstream node in the graph
        root_cause = find_root_cause(downstream_node, data, graph)
        calculate probabilities for all nodes in the graph
        construct debug trace from downstream node to root cause
        return EvaluationResult(root_cause, debug_path, node_results)
    
  4. Tracing: As the algorithm traverses the graph from downstream to upstream, it maintains a debug path, providing insights into the flow of failures through the system. The analysis culminates in an EvaluationResult object, which includes the identified root cause, the debug path, and detailed results for each node. The results can be easily converted to a JSON format for further analysis or visualization.

Overall, SCIPE analyzes independent and dependent failure probabilities to identify the most impactful problematic node in the system. This helps developers pinpoint and fix issues in their LLM-based application graph, improving overall performance and reliability.

Try it out

Here's a colab notebook try out SCIPE on sample data - demo.ipynb

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipe-1.5.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

scipe-1.5.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file scipe-1.5.1.tar.gz.

File metadata

  • Download URL: scipe-1.5.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for scipe-1.5.1.tar.gz
Algorithm Hash digest
SHA256 7d02c3cc3e17af3265c7b60ba7a4535dc7bbe2a78467eac64aea6af242c47c37
MD5 75d2f4f40d63d5b6f19db9fe8a151a7a
BLAKE2b-256 a7d278289d2be85d4aadbca137409b79b15a638ebcbbf2d5cda3d6a152110c85

See more details on using hashes here.

File details

Details for the file scipe-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: scipe-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for scipe-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd8ffb3cad393489959d5f4f601aec18f11f19875160f9ff6ab5e4b560dff82a
MD5 6f936fda7431892fccc9f6021cd20771
BLAKE2b-256 37353b5a7a7c8d0886f99f7f3ae2b07c9c0e2c2526621730d886efdf19075b3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page