Seamless Feature Extraction and Interpretation of Text Columns in Tabular Data Using Large Language Models

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

asmahani

These details have not been verified by PyPI

Project description

TabuLLM

Python package for feature extraction and interpretation of text columns in tabular data using large language models.

Overview

TabuLLM integrates LLM-based text embeddings into scikit-learn pipelines for tabular datasets containing text columns. Built on LangChain and scikit-learn, it provides sklearn-compatible transformers for embedding, dimensionality reduction, and cluster interpretation.

Installation

pip install tabullm

Core Components

TextColumnTransformer - Wraps LangChain embedding models (OpenAI, Anthropic, HuggingFace, etc.) with a sklearn interface. Handles multiple text columns with configurable concatenation and optional L2 normalization (normalize=True). Use estimate_tokens() to preview API cost before embedding.

GMMFeatureExtractor - Extends sklearn's GaussianMixture with a transform() method that returns per-cluster log-joint features $\log p(\mathbf{x}, c_k)$ — the quantity the GMM maximises for hard assignment — enabling use in sklearn pipelines. An optional include_log_density parameter appends the marginal log-density as an explicit outlier score. A companion assignment_confidence_stats() method returns per-observation cluster quality diagnostics (max_posterior, entropy, log_joint_margin, log_density).

ClusterExplainer - Generates natural language cluster descriptions using LLMs with automatic recursive summarization that scales to arbitrarily large datasets. Supports:

Cost preview (preview=True) before LLM calls
Optional outcome-based statistical testing (y) to characterize which clusters associate with a target variable
Per-observation covariates (observation_stats) — e.g., from assignment_confidence_stats() — appended to the association table
A synthesis step (synthesize=True) that produces a coherent interpretive narrative across all cluster results
An outcome label (y_label) used only in the synthesis prompt; cluster descriptions are generated without knowledge of y (blind labeling principle)

load_fraud() - Data utility that downloads and caches the fraud detection dataset from Zenodo (no credentials required), returning features, labels, and metadata.

Quick Example

from tabullm import load_fraud, TextColumnTransformer, GMMFeatureExtractor, ClusterExplainer
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load data
X, y, metadata = load_fraud()
text_cols = ['title', 'location', 'department', 'company_profile',
             'description', 'requirements', 'benefits']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Embed text columns
embedding_model = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)
text_transformer = TextColumnTransformer(model=embedding_model)

# Build pipeline: Embed → Reduce → Classify
pipeline = Pipeline([
    ('embed', text_transformer),
    ('reduce', GMMFeatureExtractor(n_components=10, random_state=42)),
    ('classify', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit and predict
pipeline.fit(X_train[text_cols], y_train)
y_pred = pipeline.predict_proba(X_test[text_cols])[:, 1]

# Interpret clusters
explainer = ClusterExplainer(
    llm=ChatOpenAI(model='gpt-4o-mini'),
    text_transformer=text_transformer,
    observations='job postings',
    text_fields='title, location, department, company profile, '
               'description, requirements, and benefits'
)

gmm = pipeline.named_steps['reduce']
cluster_labels = gmm.labels_

# Cluster descriptions only
result_df = explainer.explain(X_train[text_cols], cluster_labels)

# With outcome association + synthesis narrative
result_df, global_stats, synthesis = explainer.explain(
    X_train[text_cols], cluster_labels,
    y=y_train,
    y_label='fraudulent posting (1=fraud, 0=legitimate)',
    synthesize=True
)

# Include GMM cluster quality diagnostics in the association table
obs_stats = gmm.assignment_confidence_stats(
    pipeline.named_steps['embed'].transform(X_train[text_cols])
)
result_df, global_stats, stat_assoc_df, synthesis = explainer.explain(
    X_train[text_cols], cluster_labels,
    y=y_train,
    y_label='fraudulent posting (1=fraud, 0=legitimate)',
    observation_stats=obs_stats,
    synthesize=True
)

Examples

The examples/ folder contains Jupyter notebooks demonstrating common workflows:

01_fraud_detection_walkthrough.ipynb — core TabuLLM workflow on the fraud detection dataset: TF-IDF vs. LLM embeddings, GMM-based dimensionality reduction with cluster quality diagnostics, full ClusterExplainer usage (cost preview, outcome-based testing, per-observation diagnostics, narrative synthesis), and a predictive pipeline combining text and structured features
02_advanced_pipelines.ipynb — advanced pipeline patterns: forward/backward column sweep to measure marginal contribution of each text column, and stacking ensembles (single-split and multi-split) that process column groups independently and combine predictions via a meta-learner

Key Features

sklearn-compatible API (Pipeline, ColumnTransformer, GridSearchCV)
Access to 50+ embedding models via LangChain
Multi-column text handling with flexible concatenation
Optional L2 normalization of embedding vectors
Token and cost estimation before embedding API calls
GMM-based dimensionality reduction with per-cluster log-joint features
Optional marginal log-density feature for explicit outlier scoring
Per-observation cluster quality diagnostics (max posterior, entropy, log-joint margin, log density)
Automatic recursive summarization for arbitrarily large datasets
Cost estimation for LLM explanation calls
Outcome-based cluster characterization (binary and continuous outcomes)
User-supplied per-observation covariates in the association table
Synthesis narrative connecting cluster descriptions to outcome patterns
Blind labeling: cluster descriptions generated without knowledge of outcome vector

Release Notes

See CHANGELOG.md.

Citation

Sharabiani, M.T.A., Mahani, A.S., Bottle, A. et al. (2025). GenAI exceeds clinical experts in predicting acute kidney injury following paediatric cardiopulmonary bypass. Scientific Reports, 15, 20847. https://doi.org/10.1038/s41598-025-04651-8

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

asmahani

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.3.0

May 21, 2026

1.2.1

Apr 2, 2026

1.2.0

Mar 31, 2026

1.1.0

Mar 9, 2026

1.0.3

Mar 8, 2026

1.0.2 yanked

Mar 8, 2026

Reason this release was yanked:

faulty build

1.0.1 yanked

Mar 6, 2026

Reason this release was yanked:

faulty build

1.0.0 yanked

Mar 6, 2026

Reason this release was yanked:

readme did not match pyproject.toml; also modified load_fraud() to use Zenodo instead of Kaggle

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabullm-1.3.0.tar.gz (40.4 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabullm-1.3.0-py3-none-any.whl (29.7 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file tabullm-1.3.0.tar.gz.

File metadata

Download URL: tabullm-1.3.0.tar.gz
Upload date: May 21, 2026
Size: 40.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabullm-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f9b7fa46b57ab91954122334afb1993713ecd33766e04dabc6e2a32fc980e0ff`
MD5	`a70f83822e65a56fcd1dd6a1f4dd53d2`
BLAKE2b-256	`3da7c491c4f4614678ca883c3813049c6597d747b21dcf8c988fc5fe9e903729`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabullm-1.3.0.tar.gz:

Publisher: release.yml on asmahani/TabuLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tabullm-1.3.0.tar.gz
- Subject digest: f9b7fa46b57ab91954122334afb1993713ecd33766e04dabc6e2a32fc980e0ff
- Sigstore transparency entry: 1589214139
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: asmahani/TabuLLM@3324109def5593bacf746ac8effc5d45841384b3
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/asmahani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@3324109def5593bacf746ac8effc5d45841384b3
- Trigger Event: push

File details

Details for the file tabullm-1.3.0-py3-none-any.whl.

File metadata

Download URL: tabullm-1.3.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabullm-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2342f235a6ce2fbb08cb653fc035721bfc043e7929ec921a74905d78508d0be`
MD5	`4227b76a867cb9aca43ed0147f82c5ff`
BLAKE2b-256	`1fc6f947085afb298f4efe5a8a053e3858b09f8c463c7f12526a7d6f94b8cda2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabullm-1.3.0-py3-none-any.whl:

Publisher: release.yml on asmahani/TabuLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tabullm-1.3.0-py3-none-any.whl
- Subject digest: f2342f235a6ce2fbb08cb653fc035721bfc043e7929ec921a74905d78508d0be
- Sigstore transparency entry: 1589215540
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: asmahani/TabuLLM@3324109def5593bacf746ac8effc5d45841384b3
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/asmahani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@3324109def5593bacf746ac8effc5d45841384b3
- Trigger Event: push

tabullm 1.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TabuLLM

Overview

Installation

Core Components

Quick Example

Examples

Key Features

Release Notes

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance