Seamless Feature Extraction and Interpretation of Text Columns in Tabular Data Using Large Language Models
Project description
TabuLLM
Python package for feature extraction and interpretation of text columns in tabular data using large language models.
Overview
TabuLLM integrates LLM-based text embeddings into scikit-learn pipelines for tabular datasets containing text columns. Built on LangChain and scikit-learn, it provides sklearn-compatible transformers for embedding, dimensionality reduction, and cluster interpretation.
Installation
pip install tabullm
Core Components
TextColumnTransformer - Wraps LangChain embedding models (OpenAI, Anthropic, HuggingFace, etc.) with a sklearn interface. Handles multiple text columns with configurable concatenation and optional L2 normalization (normalize=True). Use estimate_tokens() to preview API cost before embedding.
GMMFeatureExtractor - Extends sklearn's GaussianMixture with a transform() method that returns per-cluster log-joint features $\log p(\mathbf{x}, c_k)$ — the quantity the GMM maximises for hard assignment — enabling use in sklearn pipelines. An optional include_log_density parameter appends the marginal log-density as an explicit outlier score. A companion assignment_confidence_stats() method returns per-observation cluster quality diagnostics (max_posterior, entropy, log_joint_margin, log_density).
ClusterExplainer - Generates natural language cluster descriptions using LLMs with automatic recursive summarization that scales to arbitrarily large datasets. Supports:
- Cost preview (
preview=True) before LLM calls - Optional outcome-based statistical testing (
y) to characterize which clusters associate with a target variable - Per-observation covariates (
observation_stats) — e.g., fromassignment_confidence_stats()— appended to the association table - A synthesis step (
synthesize=True) that produces a coherent interpretive narrative across all cluster results - An outcome label (
y_label) used only in the synthesis prompt; cluster descriptions are generated without knowledge ofy(blind labeling principle)
load_fraud() - Data utility that downloads and caches the fraud detection dataset from Zenodo (no credentials required), returning features, labels, and metadata.
Quick Example
from tabullm import load_fraud, TextColumnTransformer, GMMFeatureExtractor, ClusterExplainer
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load data
X, y, metadata = load_fraud()
text_cols = ['title', 'location', 'department', 'company_profile',
'description', 'requirements', 'benefits']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
# Embed text columns
embedding_model = HuggingFaceEmbeddings(
model_name='sentence-transformers/all-MiniLM-L6-v2'
)
text_transformer = TextColumnTransformer(model=embedding_model)
# Build pipeline: Embed → Reduce → Classify
pipeline = Pipeline([
('embed', text_transformer),
('reduce', GMMFeatureExtractor(n_components=10, random_state=42)),
('classify', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Fit and predict
pipeline.fit(X_train[text_cols], y_train)
y_pred = pipeline.predict_proba(X_test[text_cols])[:, 1]
# Interpret clusters
explainer = ClusterExplainer(
llm=ChatOpenAI(model='gpt-4o-mini'),
text_transformer=text_transformer,
observations='job postings',
text_fields='title, location, department, company profile, '
'description, requirements, and benefits'
)
gmm = pipeline.named_steps['reduce']
cluster_labels = gmm.labels_
# Cluster descriptions only
result_df = explainer.explain(X_train[text_cols], cluster_labels)
# With outcome association + synthesis narrative
result_df, global_stats, synthesis = explainer.explain(
X_train[text_cols], cluster_labels,
y=y_train,
y_label='fraudulent posting (1=fraud, 0=legitimate)',
synthesize=True
)
# Include GMM cluster quality diagnostics in the association table
obs_stats = gmm.assignment_confidence_stats(
pipeline.named_steps['embed'].transform(X_train[text_cols])
)
result_df, global_stats, stat_assoc_df, synthesis = explainer.explain(
X_train[text_cols], cluster_labels,
y=y_train,
y_label='fraudulent posting (1=fraud, 0=legitimate)',
observation_stats=obs_stats,
synthesize=True
)
Examples
The examples/ folder contains Jupyter notebooks demonstrating common workflows:
01_fraud_detection_walkthrough.ipynb— core TabuLLM workflow on the fraud detection dataset: TF-IDF vs. LLM embeddings, GMM-based dimensionality reduction with cluster quality diagnostics, fullClusterExplainerusage (cost preview, outcome-based testing, per-observation diagnostics, narrative synthesis), and a predictive pipeline combining text and structured features02_advanced_pipelines.ipynb— advanced pipeline patterns: forward/backward column sweep to measure marginal contribution of each text column, and stacking ensembles (single-split and multi-split) that process column groups independently and combine predictions via a meta-learner
Key Features
- sklearn-compatible API (Pipeline, ColumnTransformer, GridSearchCV)
- Access to 50+ embedding models via LangChain
- Multi-column text handling with flexible concatenation
- Optional L2 normalization of embedding vectors
- Token and cost estimation before embedding API calls
- GMM-based dimensionality reduction with per-cluster log-joint features
- Optional marginal log-density feature for explicit outlier scoring
- Per-observation cluster quality diagnostics (max posterior, entropy, log-joint margin, log density)
- Automatic recursive summarization for arbitrarily large datasets
- Cost estimation for LLM explanation calls
- Outcome-based cluster characterization (binary and continuous outcomes)
- User-supplied per-observation covariates in the association table
- Synthesis narrative connecting cluster descriptions to outcome patterns
- Blind labeling: cluster descriptions generated without knowledge of outcome vector
Release Notes
1.2.0 — Removed SphericalKMeans class. For L2-normalized embeddings, sklearn's KMeans is mathematically equivalent; GMMFeatureExtractor provides strictly richer features for pipeline use.
1.1.0 — Added multiple testing correction to explain() via the correction parameter ('bonferroni', 'holm', 'fdr_bh'). When set, a P-value (adjusted) column is appended to the per-cluster results and, when observation_stats is provided, to the stat-association table. Backward compatible: default is None (no correction).
1.0.3 — Fixed broken package installation (1.0.2 wheel was published without Python source files).
1.0.2 — Fixed __version__ mismatch; aligned __init__.py with pyproject.toml.
1.0.1 — Switched fraud dataset download from Kaggle to Zenodo (no credentials required).
1.0.0 — Initial release.
Citation
Sharabiani, M.T.A., Mahani, A.S., Bottle, A. et al. (2025). GenAI exceeds clinical experts in predicting acute kidney injury following paediatric cardiopulmonary bypass. Scientific Reports, 15, 20847. https://doi.org/10.1038/s41598-025-04651-8
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabullm-1.2.0.tar.gz.
File metadata
- Download URL: tabullm-1.2.0.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68dee8d09bfeac28d59e750aade962b828deca36a57fc9b246ba396534993c5d
|
|
| MD5 |
a4a2a7238c4b293ad9f0f28dcec4167e
|
|
| BLAKE2b-256 |
1f1a0572e9d7dbec7cc385a780d28302291cc4899e14a7770cc14b13342e0e6c
|
File details
Details for the file tabullm-1.2.0-py3-none-any.whl.
File metadata
- Download URL: tabullm-1.2.0-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d970a0eef085b2f021aaea5607cc050a03c8299ec573d63953d41a15b6f8e119
|
|
| MD5 |
db8e8a688a064b5818baf7173fcbb6e9
|
|
| BLAKE2b-256 |
d3578fcc958754d88fdc1980af315f4ad655aba02b5ccb75a1064d733ef31827
|