Skip to main content

Vocabulary separability diagnostics for text classification

Project description

poljacc

Vocabulary separability diagnostics for text classification.

Companion package for Oh and Yu (2026), "When Sparse Beats Dense: Vocabulary Separability and Model Selection in Political Text Analysis."

Author: Yongjai Yu (yongjai.yu@email.ucr.edu)

Installation

pip install poljacc

For development:

git clone https://github.com/YongjaiYu/poljacc.git
cd poljacc
pip install -e .

Quick Start

from poljacc import diagnose, compare

# Diagnose vocabulary separability
result = diagnose(texts, labels)
print(result.jaccard)           # 0.860
print(result.recommendation)    # "High vocabulary overlap..."
result.report()                 # formatted summary
result.plot()                   # show heatmap

# Run TF-IDF baseline
baseline = compare(texts, labels)
print(baseline.f1)              # 0.724
print(baseline.classification_report)

API

diagnose(texts, labels, top_k=5000)

Compute vocabulary separability between classes:

  • Jaccard similarity: pairwise overlap of top-k vocabularies (ranked by document frequency)
  • Centroid distance: Euclidean distance between TF-IDF class centroids
  • Recommendation: model selection guidance based on overlap level

Returns a DiagnosticResult with .jaccard, .jaccard_matrix, .centroid_distance, .centroid_matrix, .labels, .recommendation, .report(), and .plot().

compare(texts, labels, test_size=0.2, random_state=1017)

One-click TF-IDF + LogisticRegression baseline:

  • TF-IDF: unigrams + bigrams, max 50k features, sublinear TF
  • Logistic Regression: C tuned via 5-fold CV over {0.01, 0.1, 1, 10, 100}

Returns a ComparisonResult with .f1, .accuracy, .classification_report, .best_C.

Recommendation Thresholds

Jaccard Range Interpretation Recommendation
> 0.7 High overlap TF-IDF recommended
0.4 -- 0.7 Moderate overlap Consider both
<= 0.4 Low overlap Neural models may outperform

Dependencies

numpy, scikit-learn, matplotlib, pandas, scipy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poljacc-0.1.1.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poljacc-0.1.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file poljacc-0.1.1.tar.gz.

File metadata

  • Download URL: poljacc-0.1.1.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for poljacc-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bc9017e39cb93872a532888ed3152e730bf62affee712343a2d8cb00c7f85bcf
MD5 b6411416ea1296de0173a614fc7718fb
BLAKE2b-256 1bd35c8d01dd24501dde995f29590673fb655f4ef94f0aa94b53a11e6a656e8d

See more details on using hashes here.

File details

Details for the file poljacc-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: poljacc-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for poljacc-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 80da7967faf4fb637f8058be72911102a749a1ba69ee34e6981e29617ddccde5
MD5 59a2ad0995c45028046c4c5f8d3294ef
BLAKE2b-256 503b6f8efda63d81f81b1f5d284cc1058ce950490f303d86b66f315fbc0acc3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page