Systematic Chain Improvement and Problem Evaluation
Project description
SCIPE - Systematic Chain Improvement and Problem Evaluation
It helps you find bad nodes in LLM chains.
SCIPE is a powerful tool for evaluating and diagnosing LLM (Large Language Model) graphs or chains. It assesses LLM responses and employs a custom algorithm to identify problematic nodes within the LLM chain.
Features
- Evaluates LLM responses within simple LLM Graphs (mainly LangGraph)
- Diagnoses problematic nodes in LLM graphs
- Provides failure rates of various nodes that make up the LLM chain/graph
- Supports various LLM frameworks (uses LiteLLM underneath the hood)
Why Use SCIPE?
As AI application developers, we often overlook the critical step of evaluating LLM chains during the building phase. SCIPE simplifies this process by allowing developers to run their minimum set of prompts and responses (we recommend atleast 10 examples) through the tool. Within minutes, SCIPE reports back the problematic node in the LLM graph, enabling rapid identification and resolution of issues.
Installation
pip install scipe
Getting Started
You should have a compiled graph (from Langgraph) that you've been using for your LLM application. We'll use the nodes and edges of this graph soon. We also have a couple of examples in the examples_data
folder for you to try out.
We'll read the saved (and compiled) Langgraph using the following and convert the format to a simpler DAG which we'll feed into SCIPE.
from scipe.middleware import convert_edges_to_dag
with open("graph-healthcare.json", 'r') as f:
example_graph = json.load(f)['edges'] # We only need the edges
example_graph = convert_edges_to_dag(example_graph)
from scipe import LLMEvaluator
evaluator = LLMEvaluator(
config_path="config.yml",
responses=data,
graph=example_graph
)
results = evaluator.run_validation().find_problematic_node()
The run_validation()
runs LLM-as-judge on input/output pairs and find_problematic_node()
method traverses through the graph to figure out which node has the highest failure rate. Once it finds the problematic node, the algorithm stops and returns the result.
You can look at the results of the algorithm.
results.to_json()
Output:
{'root_cause': 'pii_insurance',
'debug_path': ['summarizer', 'extractor', 'pii_insurance'],
'node_results': {'summarizer': {'overall_failure_probability': 0.361,
'independent_failure_probability': 0.329,
'conditional_failure_probabilities': {'extractor': 0.476},
'dependencies': ['extractor'],
'is_root_cause': False},
'extractor': {'overall_failure_probability': 0.219,
'independent_failure_probability': 0.191,
'conditional_failure_probabilities': {'pii_insurance': 0.259},
'dependencies': ['pii_insurance'],
'is_root_cause': False},
'pii_insurance': {'overall_failure_probability': 0.27,
'independent_failure_probability': 0.285,
'conditional_failure_probabilities': {'pii_medications': 0.233},
'dependencies': ['pii_medications'],
'is_root_cause': True}}}
Configuration
SCIPE uses a YAML configuration file to set up your LLM graph evaluation. Here's an example of what your config.yaml might look like:
# Example config.yaml
# Where to save the LLM as judge validations for further analysis
PATH_TO_SAVE_VALIDATIONS: "validations.csv"
# Mode name to use for LLM validations
MODEL_NAME: claude-3-haiku-20240307
# Each node name, input and output columns must match the application responses
node_input_output_mappings:
pii_name_number_email:
- prompt-1
- response-1
pii_id:
- prompt-1
- response-2
pii_birthdate:
- prompt-1
- response-3
pii_medications:
- prompt-1
- response-4
pii_insurance:
- prompt-1
- response-5
extractor:
- prompt-1
- response-6
summarizer:
- prompt-1
- response-7
How it works
SCIPE works by analyzing the failure probabilities of nodes in your application graph to identify the most impactful source of failures. The core problem it addresses is:
What node's failures have the biggest impact on the most downstream node's failures?
Here's a breakdown of how SCIPE approaches this problem:
-
LLM as Judge: SCIPE first uses an LLM as a judge to evaluate each node in the application graph:
- For each node, it constructs a prompt using the node's input and output.
- The LLM judge then evaluates whether the node's output is valid given its input.
- This process generates a dataset of node evaluations across a sample of inputs.
-
Failure Analysis: For every node, SCIPE recognizes that failures can occur due to two main reasons:
- Independent failures: The node itself (or the LLM processing it) is the primary cause of the failure.
- Dependent failures: The node fails because one or more of its dependencies have failed, causing a ripple effect.
-
Root Cause Analysis: SCIPE then employs an algorithm to identify the root cause of failures. Here's a high-level pseudocode of the algorithm:
function find_root_cause(node, data, graph): calculate probabilities for node (overall, independent, and dependent) if node has no dependencies or independent failure probability is highest: mark node as root cause return node else: find dependency with highest conditional failure probability recursively call find_root_cause on that dependency function find_problematic_node(data, graph): identify the most downstream node in the graph root_cause = find_root_cause(downstream_node, data, graph) calculate probabilities for all nodes in the graph construct debug trace from downstream node to root cause return EvaluationResult(root_cause, debug_path, node_results)
-
Tracing: As the algorithm traverses the graph from downstream to upstream, it maintains a debug path, providing insights into the flow of failures through the system. The analysis culminates in an
EvaluationResult
object, which includes the identified root cause, the debug path, and detailed results for each node. The results can be easily converted to a JSON format for further analysis or visualization.
Overall, SCIPE analyzes independent and dependent failure probabilities to identify the most impactful problematic node in the system. This helps developers pinpoint and fix issues in their LLM-based application graph, improving overall performance and reliability.
Try it out
Here's a colab notebook try out SCIPE on sample data - demo.ipynb
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scipe-1.5.2.tar.gz
.
File metadata
- Download URL: scipe-1.5.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f0032eb6a47c7a62f63585e8445660c07e386ba35702e6d1ab621e8ca8c5b0a |
|
MD5 | 08bd06ff552bd074596bdb47a5a7cd78 |
|
BLAKE2b-256 | aa86beee5261c38e43a3e663cc078d1d8e7662b05f9dd62a751e1f6a0f6c8115 |
File details
Details for the file scipe-1.5.2-py3-none-any.whl
.
File metadata
- Download URL: scipe-1.5.2-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4526991440a04a839abb5dc1f0e474131bff5dd41b416bf7b73adb7368d91324 |
|
MD5 | e7887ce7c1146a93ae9a2187bebc7304 |
|
BLAKE2b-256 | 3418003183a421a3c1a1d21cfc52da0f50cfef18c1eae0b4132c485f64c90776 |