Fast spaCy-based column classifier with optional LLM refinement.
Project description
tabular-column-classifier
Classify tabular columns with spaCy’s fast model, optionally assisted by an LLM. The library keeps a predictable output format so you can plug it straight into data-quality pipelines or catalog tooling.
Features
- Uses the lightweight
en_core_web_smspaCy model for fast entity detection on column samples. - Optional LLM refinement layer with configurable host, headers, and model (tested with Ollama).
- Works with single columns or batches of DataFrames and keeps the output
{classification, probabilities}contract. - Word-count safeguard avoids labelling free-text columns as entities.
Installation
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install tabular-column-classifier
python -m spacy download en_core_web_sm
Optional extras for LLM support inside the same virtual environment:
python -m pip install tabular-column-classifier[llm]
Working from a local checkout (before the project is on PyPI)? Install with:
python -m pip install ".[llm]" # quotes required on zsh
Tip: Using
python -m pipensures the package installs into the interpreter of your active virtual environment.
Quick start
import pandas as pd
from column_classifier import ColumnClassifier
movies = pd.DataFrame(
{
"title": ["Inception", "The Matrix", "Interstellar"],
"director": ["Christopher Nolan", "The Wachowskis", "Christopher Nolan"],
"release_year": [2010, 1999, 2014],
}
)
companies = pd.DataFrame(
{
"company": ["Google", "Microsoft", "Apple"],
"hq": ["California", "Washington", "California"],
"founded": [1998, 1975, 1976],
}
)
classifier = ColumnClassifier(sample_size=25)
table_result = classifier.classify_table(movies, table_name="movies")
print(table_result["columns"]["director"]["classification"])
# PERSON
more_results = classifier.classify_multiple_tables([movies, companies])
print(more_results[1]["columns"]["founded"]["classification"])
# NUMBER
Optional LLM refinement
Add an LLM to overrule spaCy when it is confident enough. The host and headers are passed down to ollama.Client, so you can point to a gateway that requires authentication.
from column_classifier import ColumnClassifier
llm_config = {
"enabled": True,
"model": "gemma3",
"host": "https://ollama.your-company.com",
"headers": {"Authorization": "Basic QWxhZGRpbjpPcGVuU2VzYW1l"}, # Basic <base64>
"max_samples": 8, # optional: limit rows sent to the LLM
"max_retries": 2, # optional: retry if the model ignores the JSON contract
"retry_delay": 0.5, # seconds to wait between retries
}
classifier = ColumnClassifier(llm_config=llm_config, llm_weight=0.7)
table_result = classifier.classify_table(movies, table_name="movies")
print(table_result["columns"]["director"])
{'classification': 'PERSON',
'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
'sources': {
'spacy': {'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
'avg_word_count': 2.0},
'llm': {'source': 'llm',
'classification': 'PERSON',
'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
'attempt': 1}
}}
API highlights
ColumnClassifier(sample_size=50, classification_threshold=0.5, word_threshold=10, llm_config=None, llm_weight=0.5)classify_column(column_data: pd.Series, column_name: str = "column") -> dictclassify_table(table: pd.DataFrame, table_name: str = "table") -> dictclassify_multiple_tables(tables: list[pd.DataFrame]) -> list[dict]
Output schema
Every classified column yields:
{
"classification": "PERSON",
"probabilities": {"PERSON": 0.82, "STRING": 0.18},
"sources": {
"spacy": {
"probabilities": {"PERSON": 0.82, "STRING": 0.18},
"avg_word_count": 2.0
},
"llm": {
"source": "llm|heuristic|fallback",
"classification": "PERSON",
"probabilities": {"PERSON": 0.9},
"attempt": 1
}
}
}
Tip: When targeting gateways that require HTTP basic authentication, encode your
username:passwordpair with Base64 and pass it verbatim viaheaders={"Authorization": "Basic <encoded>"}. The headers dictionary is forwarded unchanged toollama.Client.
If the LLM returns something that cannot be parsed, the classifier will retry with stronger formatting instructions and ultimately fall back to the spaCy-only prediction to keep results deterministic.
Publishing to PyPI
This project ships with a setup.py configured for PyPI. Build and publish with:
python -m build
twine upload dist/*
Remember to bump the version and clean dist/ between releases.
License
Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabular_column_classifier-0.3.0.tar.gz.
File metadata
- Download URL: tabular_column_classifier-0.3.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73bc6b4ac1e56692eb06247f5c4b95b9b35edf224110c5def91c13b69351a876
|
|
| MD5 |
9c52a35a1d424b5f11615124236e8019
|
|
| BLAKE2b-256 |
beed9adf32c1a344c716bf0ab740ba0a9bf92315bcc28f955248e281049dc083
|
File details
Details for the file tabular_column_classifier-0.3.0-py3-none-any.whl.
File metadata
- Download URL: tabular_column_classifier-0.3.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e5f221a414d7a6fae052647e40892965ac1eb87130403bf315d7915a01b1f1b
|
|
| MD5 |
937d90d119483a4088997ed1439780f5
|
|
| BLAKE2b-256 |
3dc00b51b6ff10958b435a193e3d3dcb2ea918ed5d17fda80d7df5660e0a174b
|