Skip to main content

Explainable Bag-Of-Concepts implementation with automatic concept labelling using a LLM pipeline.

Project description

Explainable Bag-Of-Concepts (XBOC) Implementation

The Explainable Bag-Of-Concepts (BOC) implementation is an advanced text processing module designed to enhance document embedding techniques by adding explainability.

Documentation

You can read our documentation here.

Installation

To use the XBOC implementation, ensure that you have Python 3.6 or newer installed. You can install the module and its dependencies via pip:

pip install xboc

Usage

The default usage is to just fit the model to a corpus. The boc_matrix then contains the document embeddings of each document.

boc_model = XBOCModel(
    docs_train,
    word_vectors,
    idx2word,
)
boc_matrix, word2concept_list, idx2word_converter = boc_model.fit()

Automatic Concept Labeling

boc_model = XBOCModel(
    docs_train,
    word_vectors,
    idx2word, 
    tokenizer=CustomTokenizer(),
    n_concepts=20,
    label_impl=LabelingImplementation.TEMPLATE_CHAIN,
    llm_model=LLMModel.OPENAI_GPT3_5
)
boc_matrix, word2concept_list, idx2word_converter = boc_model.fit()

Further usage

For more details on how to use the BoC model, please take a look at the DEMO notebook..

Explainability with SHAP values

Logistic Regression

explainer = shap.LinearExplainer(log_reg, X_train)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=boc_model.get_concept_label())

Support Vectors

X_train_summary = shap.kmeans(docs_train_embedded, 50)
explainer = shap.KernelExplainer(svm.predict, X_train_summary)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())

Random Forest

explainer = shap.TreeExplainer(random_forest)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())

XGBoost

explainer = shap.TreeExplainer(xgb_classifier)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())

KNN

explainer = shap.KernelExplainer(knn.predict, docs_train_embedded) 
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())

In comparsion to BERTopic

  • c-CF-IDF normalization
  • Explainable AI - compatibility with SHAP
  • Calculate BIC, AIC using GMMs, silhouette, davies and calinski scores using a user-specified clustering method for a given list of values for K (number of concepts).

Limitations

  • Spherical KMeans is slow.
  • Cluster pollution of names in vector space (probably make 2D plots)
  • Not the best scores most likely due to word vectors (in comparison to the BoC)

Changelog of the project in comparsion to BoC

This project implements a flexible BoC module with automatic concept labelling using LLMs.

  • Automatic Concept Labeling
    • The user can use our predefined prompts for OpenAI's GPT3.5-Turbo
    • The user can provide his custom LangChain chain, that we invoke with the words that have to be labelled
    • The user can specify how many of the top N words belonging to a cluster to use
  • Flexible Clustering
    • Spheircal KMeans (default one; used in the BoC paper)
    • KMeans
    • Agglomerative Clustering
    • Spectral
  • Ability to encode new documents
  • Ability to save and load the model
  • Get the top N words for a concept.
  • Calculate BIC, AIC using GMMs, silhouette, davies and calinski scores using a user-specified clustering method for a given list of values for K (number of concepts).
  • The output is compatible with SHAP values visualizations
    • The user can train any kind of model and use SHAP to visualize the feature importance.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xboc-0.1.2.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xboc-0.1.2-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file xboc-0.1.2.tar.gz.

File metadata

  • Download URL: xboc-0.1.2.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for xboc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b836c6c0f9bfc06553b3724f7f82613435c0e63e48728614d6f4e27b90788c33
MD5 645d4369ab899c672985822a83e02907
BLAKE2b-256 6bf7f07950aae2aff1e6101ccfd4e387ad2aa4270499f124271d503b9a866828

See more details on using hashes here.

File details

Details for the file xboc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: xboc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for xboc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 33ed096e6884314fd42e97cb7985c37fd440efc772c0aa4faa309c4987772ec6
MD5 52681fdd37bbe48b5629b86567220cf1
BLAKE2b-256 44ace4e9efd00a98aa395e0a6bedac2ac7ec469731c867eb1cd60ec1c195825a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page