Skip to main content

Muzlin: a filtering toolset for semantic machine learning

Project description

When a filter cloth 🏳️ is needed rather than a simple RAG 🏴‍☠

What is it?

Muzlin merges classical ML techniques with complex generative AI. It’s goal is to apply simple, efficent, and effective methods for filtering many aspects of the generative text process train. These methods address the following questions:

  • Does a RAG/GraphRAG have any context to answer the user’s question?

  • Does the retrieved context contain good candidates to provide a complete answer (e.g. are the retrieved context too dense/sparse)?

  • Does the generated LLM response deviate from the provided context? (Hallucination)

  • Given a collection of questions, should an extracted portion of text be added to an existing RAG with respect to its ability to answer any of the questions in the collection?

  • Given an existing RAG, what is the probability that a new portion of text belongs to the RAG cluster?

  • Given a collection of embedded text (e.g. context, user question and answers, synthetic generated data, etc…), what components are considered inliers and outliers?

Muzlin is dynamic and production ready and can be added as a decision-making layer for any LLM and agentic process flows.

Note while Muzlin is production ready, it is still in a development phase and is subject to significant changes!

Quickstart

To get started use pip for installation:

pip install muzlin

In order to compared text, we need to first create a base of information. To do this we need a collection of text embeddings:

import numpy as np
from muzlin.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder()

vectors = encoder(texts) # where texts is a list of str
vectors = np.array(vectors)
np.save('vectors', vectors)

Next we will construct an unsupervised anomaly detection model using the embedded vectors:

import mlflow as ml # optional
from muzlin.anomaly import OutlierDetector
from pyod.models.pca import PCA

# Read in vectors
vectors = np.load('vectors.npy')

# Initialize OD and thresholding model
od = PCA(contamination=0.02)

ml.set_experiment('outlier_model')
clf = OutlierDetector(mlflow=True, detector=od)
clf.fit(vectors)
ml.end_run()

This anomaly model can be either logged using mlflow or simply as a joblib file.

Note that a simpler encoder e.g. 384 dimesions leads to a “fuzzy” outlier detector that is generally less strict and increases the probability that new text and the embedded collection of text will have a closer similarity. Higher dimesion encoder models can be used for a dense embedded space e.g. over 2000 vectors or for strict settings e.g. Medicine, but note that embedding time increases as well. Also, small text collections <100 or collections with a wide range of topics may degrade the filtering capabilities

Now that we have an anomaly model we can filter new incoming text. Here is an example for a RAG setting:

from muzlin.anomaly import OutlierDetector
from muzlin.encoders import HuggingFaceEncoder

# Preload trained model - or load with joblib
clf = OutlierDetector(model='outlier_detector.pkl')

# Encode question
encoder = HuggingFaceEncoder()

vector = encoder(['Who was the first man to walk on the moon?'])
vector = np.array(vector).reshape(1,-1) # Must be 2D

# Get a binary inlier 0 or outlier 1 output
label = clf.predict(vector)

The example above is just a quick dive into the capabilities of Muzlin. Go check out the example notebooks for a more in depth tutorial on all the different kinds of methods and possible applications.

Intergrations

Muzlin supports the use of many libraries for both vector and graph based setups, and is fully intergrated with MLFlow for model tracking and Pydantic for validation.

Anomaly detection

Encoders

Vector Index

  • Scikit-Learn

  • PyOD (vector)

  • PyGOD (graph)

  • PyThresh (thresholding)

  • HuggingFace

  • OpenAI

  • Cohere

  • Azure

  • Google

  • Amazon Bedrock

  • Fastembed

  • LangChain

  • LlamaIndex


Resources

Table of notebooks

Notebook

Description

Introduction

Data prep and a simple semantic vector-based outlier detection model

Optimal Threshold

Methods for optimal threshold selection (unsupervised, semi-supervised, supervised)

Cluster-Based Filtering

Using clustering to decide if retrieved documents can answer a user’s question

Graph-Based Filtering

Using graph based anomaly detection for filtering semantic graph-based systems (e.g. GraphRAG)

What Else?

Besides Muzlin there are also many other great libraries that can help to increase a generative AI process flow. Check out Semantic Router, CRAG, and Scikit-LLM


Contributing

Note at the moment their are major changes being done and the structure of Muzlin is still being refined. For now, please leave a bug report and potential new code for any fixes or improvements. You will be added as a co-author if it is implemented.

Once this phase has been completed then ->

Anyone is welcome to contribute to Muzlin:

  • Please share your ideas and ask questions by opening an issue.

  • To contribute, first check the Issue list for the “help wanted” tag and comment on the one that you are interested in. The issue will then be assigned to you.

  • If the bug, feature, or documentation change is novel (not in the Issue list), you can either log a new issue or create a pull request for the new changes.

  • To start, fork the dev branch and add your improvement/modification/fix.

  • To make sure the code has the same style and standard, please refer to detector.py for example.

  • Create a pull request to the dev branch and follow the pull request template PR template

  • Please make sure that all code changes are accompanied with proper new/updated test functions. Automatic tests will be triggered. Before the pull request can be merged, make sure that all the tests pass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

muzlin-0.0.1.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

muzlin-0.0.1-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file muzlin-0.0.1.tar.gz.

File metadata

  • Download URL: muzlin-0.0.1.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for muzlin-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fab5dc7cd1763f08288bd20d2d707006bd33278b066c1c242a7ec5e511fccb0e
MD5 a90cd8b5a8f2ba765ae7fccc9fa6699b
BLAKE2b-256 98f6f548ff2de55ae9810719630090ec5d2e641ba8aa1fcf1c27ca73c4aab468

See more details on using hashes here.

File details

Details for the file muzlin-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: muzlin-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for muzlin-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6e3e1a74053fbef07500106b2f5afdd9dfa54d2069bc4bc04acebdd929abfc2
MD5 53aa4b03ca20ccaf13cc9cc27cb19305
BLAKE2b-256 5058b61d00903f8a98cfca216a995b159e2318f466a68ea0bed4ff89dca2f1f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page