Skip to main content

Fast spaCy-based column classifier with optional LLM refinement.

Project description

tabular-column-classifier

Classify tabular columns with spaCy’s fast model, optionally assisted by an LLM. The library keeps a predictable output format so you can plug it straight into data-quality pipelines or catalog tooling.

Features

  • Uses the lightweight en_core_web_sm spaCy model for fast entity detection on column samples.
  • Optional LLM refinement layer with configurable host, headers, and model (tested with Ollama).
  • Works with single columns or batches of DataFrames and keeps the output {classification, probabilities} contract.
  • Word-count safeguard avoids labelling free-text columns as entities.

Installation

python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install tabular-column-classifier
python -m spacy download en_core_web_sm

Optional extras for LLM support inside the same virtual environment:

python -m pip install tabular-column-classifier[llm]

Working from a local checkout (before the project is on PyPI)? Install with:

python -m pip install ".[llm]"  # quotes required on zsh

Tip: Using python -m pip ensures the package installs into the interpreter of your active virtual environment.

Quick start

import pandas as pd
from column_classifier import ColumnClassifier

movies = pd.DataFrame(
    {
        "title": ["Inception", "The Matrix", "Interstellar"],
        "director": ["Christopher Nolan", "The Wachowskis", "Christopher Nolan"],
        "release_year": [2010, 1999, 2014],
    }
)

companies = pd.DataFrame(
    {
        "company": ["Google", "Microsoft", "Apple"],
        "hq": ["California", "Washington", "California"],
        "founded": [1998, 1975, 1976],
    }
)

classifier = ColumnClassifier(sample_size=25)
table_result = classifier.classify_table(movies, table_name="movies")
print(table_result["columns"]["director"]["classification"])
# PERSON

more_results = classifier.classify_multiple_tables([movies, companies])
print(more_results[1]["columns"]["founded"]["classification"])
# NUMBER

Optional LLM refinement

Add an LLM to overrule spaCy when it is confident enough. The host and headers are passed down to ollama.Client, so you can point to a gateway that requires authentication.

from column_classifier import ColumnClassifier

llm_config = {
    "enabled": True,
    "model": "gemma3",
    "host": "https://ollama.your-company.com",
    "headers": {"Authorization": "Basic QWxhZGRpbjpPcGVuU2VzYW1l"},  # Basic <base64>
    "max_samples": 8,  # optional: limit rows sent to the LLM
    "max_retries": 2,  # optional: retry if the model ignores the JSON contract
    "retry_delay": 0.5,  # seconds to wait between retries
}

classifier = ColumnClassifier(llm_config=llm_config, llm_weight=0.7)
table_result = classifier.classify_table(movies, table_name="movies")
print(table_result["columns"]["director"])
{'classification': 'PERSON',
 'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
 'sources': {
     'spacy': {'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
               'avg_word_count': 2.0},
     'llm': {'source': 'llm',
             'classification': 'PERSON',
             'probabilities': {'PERSON': 0.67, 'STRING': 0.33},
             'attempt': 1}
 }}

API highlights

  • ColumnClassifier(sample_size=50, classification_threshold=0.5, word_threshold=10, llm_config=None, llm_weight=0.5)
  • classify_column(column_data: pd.Series, column_name: str = "column") -> dict
  • classify_table(table: pd.DataFrame, table_name: str = "table") -> dict
  • classify_multiple_tables(tables: list[pd.DataFrame]) -> list[dict]

Output schema

Every classified column yields:

{
  "classification": "PERSON",
  "probabilities": {"PERSON": 0.82, "STRING": 0.18},
  "sources": {
    "spacy": {
      "probabilities": {"PERSON": 0.82, "STRING": 0.18},
      "avg_word_count": 2.0
    },
    "llm": {
      "source": "llm|heuristic|fallback",
      "classification": "PERSON",
      "probabilities": {"PERSON": 0.9},
      "attempt": 1
    }
  }
}

Tip: When targeting gateways that require HTTP basic authentication, encode your username:password pair with Base64 and pass it verbatim via headers={"Authorization": "Basic <encoded>"}. The headers dictionary is forwarded unchanged to ollama.Client.

If the LLM returns something that cannot be parsed, the classifier will retry with stronger formatting instructions and ultimately fall back to the spaCy-only prediction to keep results deterministic.

Publishing to PyPI

This project ships with a setup.py configured for PyPI. Build and publish with:

python -m build
twine upload dist/*

Remember to bump the version and clean dist/ between releases.

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabular_column_classifier-0.3.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabular_column_classifier-0.3.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file tabular_column_classifier-0.3.0.tar.gz.

File metadata

File hashes

Hashes for tabular_column_classifier-0.3.0.tar.gz
Algorithm Hash digest
SHA256 73bc6b4ac1e56692eb06247f5c4b95b9b35edf224110c5def91c13b69351a876
MD5 9c52a35a1d424b5f11615124236e8019
BLAKE2b-256 beed9adf32c1a344c716bf0ab740ba0a9bf92315bcc28f955248e281049dc083

See more details on using hashes here.

File details

Details for the file tabular_column_classifier-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tabular_column_classifier-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e5f221a414d7a6fae052647e40892965ac1eb87130403bf315d7915a01b1f1b
MD5 937d90d119483a4088997ed1439780f5
BLAKE2b-256 3dc00b51b6ff10958b435a193e3d3dcb2ea918ed5d17fda80d7df5660e0a174b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page