Explainable Bag-Of-Concepts implementation with automatic concept labelling using a LLM pipeline.
Project description
Explainable Bag-Of-Concepts (XBOC) Implementation
The Explainable Bag-Of-Concepts (BOC) implementation is an advanced text processing module designed to enhance document embedding techniques by adding explainability.
Documentation
You can read our documentation here.
Installation
To use the XBOC implementation, ensure that you have Python 3.6 or newer installed. You can install the module and its dependencies via pip:
pip install xboc
Usage
The default usage is to just fit the model to a corpus. The boc_matrix then contains the document embeddings of each document.
boc_model = XBOCModel(
docs_train,
word_vectors,
idx2word,
)
boc_matrix, word2concept_list, idx2word_converter = boc_model.fit()
Automatic Concept Labeling
boc_model = XBOCModel(
docs_train,
word_vectors,
idx2word,
tokenizer=CustomTokenizer(),
n_concepts=20,
label_impl=LabelingImplementation.TEMPLATE_CHAIN,
llm_model=LLMModel.OPENAI_GPT3_5
)
boc_matrix, word2concept_list, idx2word_converter = boc_model.fit()
Further usage
For more details on how to use the BoC model, please take a look at the DEMO notebook..
Explainability with SHAP values
Logistic Regression
explainer = shap.LinearExplainer(log_reg, X_train)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=boc_model.get_concept_label())
Support Vectors
X_train_summary = shap.kmeans(docs_train_embedded, 50)
explainer = shap.KernelExplainer(svm.predict, X_train_summary)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())
Random Forest
explainer = shap.TreeExplainer(random_forest)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())
XGBoost
explainer = shap.TreeExplainer(xgb_classifier)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())
KNN
explainer = shap.KernelExplainer(knn.predict, docs_train_embedded)
shap_values = explainer.shap_values(docs_test_np)
shap.summary_plot(shap_values, docs_test_np, feature_names=boc_model.get_concept_label())
In comparsion to BERTopic
- c-CF-IDF normalization
- Explainable AI - compatibility with SHAP
- Calculate BIC, AIC using GMMs, silhouette, davies and calinski scores using a user-specified clustering method for a given list of values for K (number of concepts).
Limitations
- Spherical KMeans is slow.
- Cluster pollution of names in vector space (probably make 2D plots)
- Not the best scores most likely due to word vectors (in comparison to the BoC)
Changelog of the project in comparsion to BoC
This project implements a flexible BoC module with automatic concept labelling using LLMs.
- Automatic Concept Labeling
- The user can use our predefined prompts for OpenAI's GPT3.5-Turbo
- The user can provide his custom LangChain chain, that we invoke with the words that have to be labelled
- The user can specify how many of the top N words belonging to a cluster to use
- Flexible Clustering
- Spheircal KMeans (default one; used in the BoC paper)
- KMeans
- Agglomerative Clustering
- Spectral
- Ability to encode new documents
- Ability to save and load the model
- Get the top N words for a concept.
- Calculate BIC, AIC using GMMs, silhouette, davies and calinski scores using a user-specified clustering method for a given list of values for K (number of concepts).
- The output is compatible with SHAP values visualizations
- The user can train any kind of model and use SHAP to visualize the feature importance.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xboc-0.1.2.tar.gz.
File metadata
- Download URL: xboc-0.1.2.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b836c6c0f9bfc06553b3724f7f82613435c0e63e48728614d6f4e27b90788c33
|
|
| MD5 |
645d4369ab899c672985822a83e02907
|
|
| BLAKE2b-256 |
6bf7f07950aae2aff1e6101ccfd4e387ad2aa4270499f124271d503b9a866828
|
File details
Details for the file xboc-0.1.2-py3-none-any.whl.
File metadata
- Download URL: xboc-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33ed096e6884314fd42e97cb7985c37fd440efc772c0aa4faa309c4987772ec6
|
|
| MD5 |
52681fdd37bbe48b5629b86567220cf1
|
|
| BLAKE2b-256 |
44ace4e9efd00a98aa395e0a6bedac2ac7ec469731c867eb1cd60ec1c195825a
|