Skip to main content

Seamless Feature Extraction and Interpretation of Text Columns in Tabular Data Using Large Language Models

Project description

TabuLLM

Python package for feature extraction and interpretation of text columns in tabular data using large language models.

Overview

TabuLLM integrates LLM-based text embeddings into scikit-learn pipelines for tabular datasets containing text columns. Built on LangChain and scikit-learn, it provides sklearn-compatible transformers for embedding, dimensionality reduction, and cluster interpretation.

Installation

pip install tabullm

Core Components

TextColumnTransformer - Wraps LangChain embedding models (OpenAI, Anthropic, HuggingFace, etc.) with a sklearn interface. Handles multiple text columns with configurable concatenation and optional L2 normalization (normalize=True). Use estimate_tokens() to preview API cost before embedding.

GMMFeatureExtractor - Extends sklearn's GaussianMixture with a transform() method that returns per-cluster log-joint features $\log p(\mathbf{x}, c_k)$ — the quantity the GMM maximises for hard assignment — enabling use in sklearn pipelines. An optional include_log_density parameter appends the marginal log-density as an explicit outlier score. A companion assignment_confidence_stats() method returns per-observation cluster quality diagnostics (max_posterior, entropy, log_joint_margin, log_density).

SphericalKMeans - K-means clustering with cosine distance for L2-normalized embeddings. For normalized embeddings, mathematically equivalent to sklearn's KMeans. Available as an alternative hard-clustering option when GMM-based features are not needed.

ClusterExplainer - Generates natural language cluster descriptions using LLMs with automatic recursive summarization that scales to arbitrarily large datasets. Supports:

  • Cost preview (preview=True) before LLM calls
  • Optional outcome-based statistical testing (y) to characterize which clusters associate with a target variable
  • Per-observation covariates (observation_stats) — e.g., from assignment_confidence_stats() — appended to the association table
  • A synthesis step (synthesize=True) that produces a coherent interpretive narrative across all cluster results
  • An outcome label (y_label) used only in the synthesis prompt; cluster descriptions are generated without knowledge of y (blind labeling principle)

load_fraud() - Data utility that downloads and caches the fraud detection dataset from Zenodo (no credentials required), returning features, labels, and metadata.

Quick Example

from tabullm import TextColumnTransformer, GMMFeatureExtractor, ClusterExplainer
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Embed text columns
embedding_model = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)
text_transformer = TextColumnTransformer(
    model=embedding_model,
    colnames={'title': 'Title', 'description': 'Description'}
)

# Build pipeline: Embed → Reduce → Classify
pipeline = Pipeline([
    ('embed', text_transformer),
    ('reduce', GMMFeatureExtractor(n_components=10)),
    ('classify', RandomForestClassifier(n_estimators=100))
])

# Fit and predict
pipeline.fit(df[['title', 'description']], y)
predictions = pipeline.predict(df_new[['title', 'description']])

# Interpret clusters
explainer = ClusterExplainer(
    llm=ChatOpenAI(model='gpt-4o-mini'),
    text_transformer=text_transformer,
    observations='job postings',
    text_fields='titles and descriptions'
)

gmm = pipeline.named_steps['reduce']
cluster_labels = gmm.labels_

# Cluster descriptions only
result_df = explainer.explain(df, cluster_labels)

# With outcome association + synthesis narrative
result_df, global_stats, synthesis = explainer.explain(
    df, cluster_labels,
    y=y,
    y_label='fraudulent posting (1=fraud, 0=legitimate)',
    synthesize=True
)

# Include GMM cluster quality diagnostics in the association table
obs_stats = gmm.assignment_confidence_stats(
    pipeline.named_steps['embed'].transform(df)
)
result_df, global_stats, stat_assoc_df, synthesis = explainer.explain(
    df, cluster_labels,
    y=y,
    y_label='fraudulent posting (1=fraud, 0=legitimate)',
    observation_stats=obs_stats,
    synthesize=True
)

Key Features

  • sklearn-compatible API (Pipeline, ColumnTransformer, GridSearchCV)
  • Access to 50+ embedding models via LangChain
  • Multi-column text handling with flexible concatenation
  • Optional L2 normalization of embedding vectors
  • Token and cost estimation before embedding API calls
  • GMM-based dimensionality reduction with per-cluster log-joint features
  • Optional marginal log-density feature for explicit outlier scoring
  • Per-observation cluster quality diagnostics (max posterior, entropy, log-joint margin, log density)
  • Automatic recursive summarization for arbitrarily large datasets
  • Cost estimation for LLM explanation calls
  • Outcome-based cluster characterization (binary and continuous outcomes)
  • User-supplied per-observation covariates in the association table
  • Synthesis narrative connecting cluster descriptions to outcome patterns
  • Blind labeling: cluster descriptions generated without knowledge of outcome vector

Release Notes

1.0.3 — Fixed broken package installation (1.0.2 wheel was published without Python source files).

1.0.2 — Fixed __version__ mismatch; aligned __init__.py with pyproject.toml.

1.0.1 — Switched fraud dataset download from Kaggle to Zenodo (no credentials required).

1.0.0 — Initial release.

Citation

Sharabiani, M.T.A., Mahani, A.S., Bottle, A. et al. (2025). GenAI exceeds clinical experts in predicting acute kidney injury following paediatric cardiopulmonary bypass. Scientific Reports, 15, 20847. https://doi.org/10.1038/s41598-025-04651-8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabullm-1.0.3.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabullm-1.0.3-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file tabullm-1.0.3.tar.gz.

File metadata

  • Download URL: tabullm-1.0.3.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for tabullm-1.0.3.tar.gz
Algorithm Hash digest
SHA256 48aeda1fdd3fd6a5a84a16904fec0f316c4bceefbe99a96d1e6ab4af697185d4
MD5 f7c811ee0f750f93b304a0c90b647497
BLAKE2b-256 98252434e7b1b561b01f13cbe2e0c71e6afd2be582e7be7d0ae8f8d0183ce6c5

See more details on using hashes here.

File details

Details for the file tabullm-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: tabullm-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for tabullm-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c9c3ac458eba5affb458b9deb0429d43c78627fd030796c9a81c8c0d9ae5e212
MD5 7477c23219f69f36f17319d122cc35ee
BLAKE2b-256 e3b1b6dab123c6ab861d944669dfbe09a9de409e9bcfef388f89e45192817843

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page