Skip to main content

Generate synthetic data from seed datasets.

Project description

FiddleCube - Generate golden evaluation datasets for RAG

Test and evaluate your RAG system, get actionable diagnosis of the top problems. Systematically improve the prompt, RAG and data points in the RAG storage.

Evaluation dilemma - scale vs effectiveness

Evals are either highly effective - human evaluation and manual QA. Or they scale well while being unreliable - LLM as a judge.

Data driven evaluation

Evaluating an LLM effectively needs:

  1. An accurate representation of possible queries and usages of the LLM and RAG DB.
  2. Gold-standard responses to the queries, grounded in knowledge.
  3. Evaluate the outputs based on subjective rules and business logic
  4. Scale to accomodate fast-paced dev teams - ideally as a part of CI/CD

Evaluating using a golden dataset - step by step

  1. Pass a list of chunks from your vector DB to the fiddlecube generate API.
  2. FiddleCube generates a wide range of queries and answers across 7 question type. Includes multi-turn conversations, QA and complex reasoning.
  3. Use the golden dataset to evaluate real user interactions with your LLM.
  4. FiddleCube will diagnose failed queries providing a step by step analysis of the failure.
  5. Step by step analysis of the output.
  6. Pinpoint the exact root cause of the failure
  7. Actionable prompt improvement or RAG updates.

Installation

pip3 install fiddlecube

Usage

Generate data

from fiddlecube import FiddleCube

fc = FiddleCube(api_key="<api-key>")
dataset = fc.generate(
    [
        "The cat did not want to be petted.",
        "The cat was not happy with the owner's behavior.",
    ],
    10,
)
print(dataset)
{
   "results":[
      {
         "query":"Question: Why did the cat not want to be petted?",
         "contexts":[
            "The cat did not want to be petted."
         ],
         "answer":"The cat did not want to be petted because it was not in the mood for physical affection at that moment.",
         "score":0.8,
         "question_type":"SIMPLE"
      },
      {
         "query":"Was the cat pleased with the owner's actions?",
         "contexts":[
            "The cat was not happy with the owner's behavior."
         ],
         "answer":"No, the cat was not pleased with the owner's actions.",
         "score":0.8,
         "question_type":"NEGATIVE"
      }
   ],
   "status":"COMPLETED",
   "num_tokens_generated":44,
   "rate_limited":false
}

Diagnose Failures

Pass a set of:

  • query
  • prompt
  • contexts
  • answer

Get a step by step diagnosis of how the LLM came up with the answer. Trace through the prompt, the RAG query to root cause the failure. Iteratively improve the prompt, RAG accuracy to build a robust system.

diagnosis = fc.diagnose({
        "query": "What is the capital of France?",
        "answer": "Paris",
        "prompt": "You are an expert at answering hard questions.",
        "context": ["Paris is the capital of France."],
    })
print("==diagnosis==", diagnosis)
{
  "dataset": [
    {
      "c_b_a": "0",
      "related": "0",
      "t_c": "answering hard questions",
      "t_q": "capital of France",
      "sum": "The context is about answering hard questions, which is not related to the query about the capital of France.",
      "info": "Include information about France or its capital in the context.",
      "data": {
        "query": "What is the capital of France?",
        "answer": "Paris",
        "prompt": "You are an expert at answering hard questions.",
        "context": ["Paris is the capital of France."]
      },
      "gen_ans": "Paris",
      "is_similar": "1",
      "summary": "The query asks for the capital of France, which is a straightforward question. Given the context that the assistant is an expert at answering hard questions, it is logical to deduce that the assistant should know basic geographical facts such as the capital of France. Therefore, the answer to the query is 'Paris'."
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fiddlecube-0.1.1.tar.gz (2.9 kB view details)

Uploaded Source

Built Distribution

fiddlecube-0.1.1-py3-none-any.whl (3.4 kB view details)

Uploaded Python 3

File details

Details for the file fiddlecube-0.1.1.tar.gz.

File metadata

  • Download URL: fiddlecube-0.1.1.tar.gz
  • Upload date:
  • Size: 2.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.3.0

File hashes

Hashes for fiddlecube-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7c05c5938f9ff2c17ef1c5a713b59ff7250e00a3fdf7c8cf34f4b64e1991529a
MD5 a802afca8d3567f234cffc530f434a28
BLAKE2b-256 c86a3308318674f28a8adbf90b931f1dbec7650d80490303e2db0d6a07bcdeae

See more details on using hashes here.

File details

Details for the file fiddlecube-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fiddlecube-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/22.3.0

File hashes

Hashes for fiddlecube-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac363e4af1aa276a1aa74e8413be338ff72cacb8ae2e7158ade2b28935b4e631
MD5 9460b9bffb21a17d653ef0a92b932a28
BLAKE2b-256 cd3945636a8dbbb7e65583f79c386cdef7abb13322422c50b2ae54097f66140a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page