Vocabulary separability diagnostics for text classification
Project description
poljacc
Vocabulary separability diagnostics for text classification.
Companion package for Oh and Yu (2026), "When Sparse Beats Dense: Vocabulary Separability and Model Selection in Political Text Analysis."
Author: Yongjai Yu (yongjai.yu@email.ucr.edu)
Installation
pip install poljacc
For development:
git clone https://github.com/YongjaiYu/poljacc.git
cd poljacc
pip install -e .
Quick Start
from poljacc import diagnose, compare
# Diagnose vocabulary separability
result = diagnose(texts, labels)
print(result.jaccard) # 0.860
print(result.recommendation) # "High vocabulary overlap..."
result.report() # formatted summary
result.plot() # show heatmap
# Run TF-IDF baseline
baseline = compare(texts, labels)
print(baseline.f1) # 0.724
print(baseline.classification_report)
API
diagnose(texts, labels, top_k=5000)
Compute vocabulary separability between classes:
- Jaccard similarity: pairwise overlap of top-k vocabularies (ranked by document frequency)
- Centroid distance: Euclidean distance between TF-IDF class centroids
- Recommendation: model selection guidance based on overlap level
Returns a DiagnosticResult with .jaccard, .jaccard_matrix, .centroid_distance, .centroid_matrix, .labels, .recommendation, .report(), and .plot().
compare(texts, labels, test_size=0.2, random_state=1017)
One-click TF-IDF + LogisticRegression baseline:
- TF-IDF: unigrams + bigrams, max 50k features, sublinear TF
- Logistic Regression: C tuned via 5-fold CV over {0.01, 0.1, 1, 10, 100}
Returns a ComparisonResult with .f1, .accuracy, .classification_report, .best_C.
Recommendation Thresholds
| Jaccard Range | Interpretation | Recommendation |
|---|---|---|
| > 0.7 | High overlap | TF-IDF recommended |
| 0.4 -- 0.7 | Moderate overlap | Consider both |
| <= 0.4 | Low overlap | Neural models may outperform |
Dependencies
numpy, scikit-learn, matplotlib, pandas, scipy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poljacc-0.1.1.tar.gz.
File metadata
- Download URL: poljacc-0.1.1.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc9017e39cb93872a532888ed3152e730bf62affee712343a2d8cb00c7f85bcf
|
|
| MD5 |
b6411416ea1296de0173a614fc7718fb
|
|
| BLAKE2b-256 |
1bd35c8d01dd24501dde995f29590673fb655f4ef94f0aa94b53a11e6a656e8d
|
File details
Details for the file poljacc-0.1.1-py3-none-any.whl.
File metadata
- Download URL: poljacc-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80da7967faf4fb637f8058be72911102a749a1ba69ee34e6981e29617ddccde5
|
|
| MD5 |
59a2ad0995c45028046c4c5f8d3294ef
|
|
| BLAKE2b-256 |
503b6f8efda63d81f81b1f5d284cc1058ce950490f303d86b66f315fbc0acc3f
|