Skip to main content

TF-IDF zone analysis CLI — classify terms into too-common, goldilocks, and too-rare zones

Project description

tfidf-zones

Python 3.11+ Poetry License Downloads Downloads/Month

CLI tool that classifies terms in text documents into three zones based on TF-IDF and document frequency:

  • Too Common — high document frequency (df > 0.2N)
  • Goldilocks — high TF-IDF score within a moderate DF band (3 ≤ df ≤ 0.2N, tfidf ≥ Q95)
  • Too Rare — low document frequency (df < 3)

Useful for stylometric analysis, authorship attribution, and understanding term importance.

Install

poetry install

Usage

# Analyze a single file
poetry run tfidf-zones --file novel.txt --output results.csv

# Analyze with bigrams
poetry run tfidf-zones --file novel.txt --ngram 2 --output results.csv

# Analyze a directory of .txt files
poetry run tfidf-zones --dir ./texts/ --output results.csv

# Show top 25 terms per zone with custom chunk size
poetry run tfidf-zones --file novel.txt --top-k 25 --chunk-size 500 --output results.csv

Recipes

Find content-word bigrams across a corpus:

poetry run tfidf-zones \
  --dir ./texts/ --limit 100 --output results.csv \
  --no-chunk --wordnet --ngram 2 --no-ngram-stopwords

Combines --wordnet (only real English words), --ngram 2 (bigrams), and --no-ngram-stopwords (discard bigrams containing stop/function words like "of_the") to surface meaningful two-word terms.

Find content phrases (trigrams and above):

poetry run tfidf-zones \
  --dir ./texts/ --output results.csv \
  --no-chunk --wordnet --ngram 3 --no-ngram-stopwords

Increase --ngram to 3, 4, or 5 to find longer phrases. The stopword filter removes any n-gram where at least one token is a stop word or function word, so only content-rich phrases survive.

Corpus analysis with post-processing filters:

poetry run tfidf-zones \
  --dir ./texts/ --output results.csv \
  --no-chunk --wordnet --min-df 2 --min-tf 2

Use --min-df and --min-tf to remove terms that appear in too few documents or have too few total occurrences, reducing noise from hapax legomena.

Options

Flag Default Description
--file Path to a single text file
--dir Path to a directory of .txt files
--scikit off Use scikit-learn TF-IDF engine (optional; supports all the same flags)
--top-k 10 Number of terms per zone
--ngram 1 N-gram level (1–5, or 6 for skipgrams)
--chunk-size 2000 Tokens per chunk (min 100)
--limit all Randomly select N files from directory (requires --dir)
--no-chunk off Each file = one document, no chunking (requires --dir)
--wordnet off Only recognized English words participate in TF-IDF
--no-ngram-stopwords off Discard n-grams containing stop/function words (requires --ngram ≥ 2)
--min-df Remove terms with document frequency below this value
--min-tf Remove terms with term frequency below this value
--output Output CSV file path (required)

Either --file or --dir is required (not both).

How It Works

Text is tokenized, split into chunks, and scored with TF-IDF. Chunking a single document into sub-documents prevents IDF from collapsing to a constant. Terms are then bucketed into zones by their document-frequency percentile.

Engines

The default engine is a pure-Python implementation that requires no heavy dependencies and generally produces better results for zone analysis. It uses smooth IDF (log((1+N)/(1+DF)) + 1) with full control over tokenization, n-gram generation, and scoring.

A scikit-learn engine (--scikit) is also available for users familiar with the scikit-learn ecosystem or who want to compare results against its TfidfVectorizer. Both engines use the same IDF formula and produce comparable output. scikit-learn is an optional dependency — install it with:

pip install tfidf-zones[scikit]

or if using Poetry:

poetry install -E scikit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfidf_zones-1.2.2.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tfidf_zones-1.2.2-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file tfidf_zones-1.2.2.tar.gz.

File metadata

  • Download URL: tfidf_zones-1.2.2.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for tfidf_zones-1.2.2.tar.gz
Algorithm Hash digest
SHA256 443bc00a0f4233cef15f69029ea8f438f065c0f5d9c20300ffd713f1c4e9ae0b
MD5 0289e234087f3cdb9c86a6a00e8f4d91
BLAKE2b-256 bebf3350aff694c42c1a32f82cfdf916dec9f3e50eb5a2f90ded98147d6765c2

See more details on using hashes here.

File details

Details for the file tfidf_zones-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: tfidf_zones-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for tfidf_zones-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec0e8a9158c56f4f2d6e819d3922b868698f540f07fd948a8e899738a809d6e8
MD5 2b7e84717ae4156342758b7bbec5afc4
BLAKE2b-256 d6fb0bffccc09ca1e06398f4f4181162bda14975d999ba162fca5fee39b12c0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page