Skip to main content

Small Language Models Evaluation Suite for RAG Systems

Project description

smallevals logo - Small Language Models Evaluation Suite for RAG Systems

A lightweight evaluation framework powered by tiny ( really tiny logo ) 0.6B models — runs 100% locally on CPU/GPU/MPS, attach any vector DB connection and run, fast and free.

smallevals demo

Evaluation tools requiring LLM-as-a-judge or external, that costs/doesn't scale easily. logo evaluates in seconds in GPU, in minutes in any CPU logo logo!

Evaluate Retrieval

Evaluation of RAG system includes retrieval and RAG stage, logo attacks to test retrieval and RAG answers(in the near future)!

Models

Model Name Task Status Link
QAG-0.6B Generate golden Q/A from chunks or docs (synthetic evaluation data) Available 🤗
CRC-0.6B Context relevance classifier (question ↔ retrieved chunk) Incoming
GJ-0.6B Groundedness / faithfulness judge (answer ↔ context) Incoming
ASM-0.6B Answer correctness / semantic similarity Incoming

Current Focus: Retrieval evaluation (QAG-0.6B), after being sure the model generates correct answers and better questions for RAG(it does, but still room for improvement), the model will be the first model of pipeline leading to (RAG) generation evaluation models (CRC-0.6B, GJ-0.6B, ASM-0.5B) which are the future work.

How does it work?

Question Generator model, reads your chunk, assumes the chunk is the one that answers the question, and tries to match it back via Vector DB query.

This allows directly to test your retrieval pipelines tied to your RAG systems. Whatever the complexity of your RAG system, you'll be sure if your vector queries works fine.

Why this is a need?

Other frameworks requiring APIs are costly, hard-to-scale, although they are better(for now).

Installation

pip install smallevals

Quick Start

Evaluate Retrieval Quality (Python)

Connect to your favourite Vector DB (Milvus, Elastic, PGVector, Chroma, Pinecone, FAISS, Weawiate), attach your favourite embeddings, generate questions, and visualise results!

Under the hood, logo generates question per chunk, and tries to retrieve it as a single-first relevant docs, calculate scores.

from smallevals import evaluate_retrievals, SmallEvalsVDBConnection

vdb = SmallEvalsVDBConnection(
    connection=chroma_client,
    collection="my_collection",
    embedding=embedding
)

# Run evaluation
result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200) # Generate question for 200 chunks, and test to retrieve them!

And evaluate results!

smallevals dash --host 0.0.0.0 --port 8050 --debug

Generate QA from Documents (CLI)

smallevals generate_qa --docs-dir ./documents --num-questions 100

### QAG-0.6B

The model was trained on TriviaQA, SQuAD 2.0, Hand-curated synthetic data generated using Qwen-70B , generating a question from the chunk/doc.

Given the passage below, extract ONE question/answer pair grounded strictly in a single atomic fact.

PASSAGE:
"Eiffel tower is built at 1989"

Return ONLY a JSON object.
{
  "question": "When was the Eiffel Tower completed?",
  "answer": "1889"
}

Known issues:

  • Model is trained on text/wiki data, bias towards well structured text.
  • Dataset contains question that ask generic questions, dataset will be more carefully crafted in v3.

### Other Models:

Other models to be trained to eliminate the need of external LLMs.

CRC-0.6B : Context relevance classifier (question ↔ retrieved chunk) GJ-0.6B : Groundedness / faithfulness judge (answer ↔ context)
ASM-0.6B | Answer correctness / semantic similarity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallevals-0.1.8.tar.gz (102.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smallevals-0.1.8-py3-none-any.whl (106.7 kB view details)

Uploaded Python 3

File details

Details for the file smallevals-0.1.8.tar.gz.

File metadata

  • Download URL: smallevals-0.1.8.tar.gz
  • Upload date:
  • Size: 102.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for smallevals-0.1.8.tar.gz
Algorithm Hash digest
SHA256 28f06492405d325f50484fad9e82497c1ccc1d45cb57c603b327bb9ca455cbaf
MD5 d4f5f88fe1d9fca91b5d57b720fabfd3
BLAKE2b-256 c8e7b6fd7bb528c257d59134e5f3565405153c83da59bc71f1cb6649f8829b1b

See more details on using hashes here.

File details

Details for the file smallevals-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: smallevals-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 106.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for smallevals-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 e8ba56e8f18c49afcb7fa11c7d58f30561c4e521f67d07f775a9590ef93af024
MD5 a8461646224f31f93182fbe6fc0b058c
BLAKE2b-256 f598b7e5b59b4d10d0692c554b19ad2894877f3252e604fd0fd7c3f6da20059d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page