TF-IDF zone analysis CLI — classify terms into too-common, goldilocks, and too-rare zones

These details have not been verified by PyPI

Project description

tfidf-zones

CLI tool that classifies terms in text documents into three zones based on TF-IDF and document frequency:

Too Common — high document frequency (df > 0.2N)
Goldilocks — high TF-IDF score within a moderate DF band (3 ≤ df ≤ 0.2N, tfidf ≥ Q95)
Too Rare — low document frequency (df < 3)

Useful for stylometric analysis, authorship attribution, and understanding term importance.

Install

poetry install

Usage

# Analyze a single file
poetry run tfidf-zones --file novel.txt --output results.csv

# Analyze with bigrams
poetry run tfidf-zones --file novel.txt --ngram 2 --output results.csv

# Analyze a directory of .txt files
poetry run tfidf-zones --dir ./texts/ --output results.csv

# Show top 25 terms per zone with custom chunk size
poetry run tfidf-zones --file novel.txt --top-k 25 --chunk-size 500 --output results.csv

Recipes

Find content-word bigrams across a corpus:

poetry run tfidf-zones \
  --dir ./texts/ --limit 100 --output results.csv \
  --no-chunk --wordnet --ngram 2 --no-ngram-stopwords

Combines --wordnet (only real English words), --ngram 2 (bigrams), and --no-ngram-stopwords (discard bigrams containing stop/function words like "of_the") to surface meaningful two-word terms.

Find content phrases (trigrams and above):

poetry run tfidf-zones \
  --dir ./texts/ --output results.csv \
  --no-chunk --wordnet --ngram 3 --no-ngram-stopwords

Increase --ngram to 3, 4, or 5 to find longer phrases. The stopword filter removes any n-gram where at least one token is a stop word or function word, so only content-rich phrases survive.

Corpus analysis with post-processing filters:

poetry run tfidf-zones \
  --dir ./texts/ --output results.csv \
  --no-chunk --wordnet --min-df 2 --min-tf 2

Use --min-df and --min-tf to remove terms that appear in too few documents or have too few total occurrences, reducing noise from hapax legomena.

Options

Flag	Default	Description
`--file`		Path to a single text file
`--dir`		Path to a directory of `.txt` files
`--scikit`	off	Use scikit-learn TF-IDF engine (optional; supports all the same flags)
`--top-k`	`10`	Number of terms per zone
`--ngram`	`1`	N-gram level (1–5, or 6 for skipgrams)
`--chunk-size`	`2000`	Tokens per chunk (min 100)
`--limit`	all	Randomly select N files from directory (requires `--dir`)
`--no-chunk`	off	Each file = one document, no chunking (requires `--dir`)
`--wordnet`	off	Only recognized English words participate in TF-IDF
`--no-ngram-stopwords`	off	Discard n-grams containing stop/function words (requires `--ngram` ≥ 2)
`--min-df`		Remove terms with document frequency below this value
`--min-tf`		Remove terms with term frequency below this value
`--output`		Output CSV file path (required)

Either --file or --dir is required (not both).

How It Works

Text is tokenized, split into chunks, and scored with TF-IDF. Chunking a single document into sub-documents prevents IDF from collapsing to a constant. Terms are then bucketed into zones by their document-frequency percentile.

Engines

The default engine is a pure-Python implementation that requires no heavy dependencies and generally produces better results for zone analysis. It uses smooth IDF (log((1+N)/(1+DF)) + 1) with full control over tokenization, n-gram generation, and scoring.

A scikit-learn engine (--scikit) is also available for users familiar with the scikit-learn ecosystem or who want to compare results against its TfidfVectorizer. Both engines use the same IDF formula and produce comparable output. scikit-learn is an optional dependency — install it with:

pip install tfidf-zones[scikit]

or if using Poetry:

poetry install -E scikit

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.2

Jan 30, 2026

1.2.1

Jan 30, 2026

1.2.0

Jan 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfidf_zones-1.2.2.tar.gz (30.9 kB view details)

Uploaded Jan 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfidf_zones-1.2.2-py3-none-any.whl (33.5 kB view details)

Uploaded Jan 30, 2026 Python 3

File details

Details for the file tfidf_zones-1.2.2.tar.gz.

File metadata

Download URL: tfidf_zones-1.2.2.tar.gz
Upload date: Jan 30, 2026
Size: 30.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for tfidf_zones-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`443bc00a0f4233cef15f69029ea8f438f065c0f5d9c20300ffd713f1c4e9ae0b`
MD5	`0289e234087f3cdb9c86a6a00e8f4d91`
BLAKE2b-256	`bebf3350aff694c42c1a32f82cfdf916dec9f3e50eb5a2f90ded98147d6765c2`

See more details on using hashes here.

File details

Details for the file tfidf_zones-1.2.2-py3-none-any.whl.

File metadata

Download URL: tfidf_zones-1.2.2-py3-none-any.whl
Upload date: Jan 30, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for tfidf_zones-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec0e8a9158c56f4f2d6e819d3922b868698f540f07fd948a8e899738a809d6e8`
MD5	`2b7e84717ae4156342758b7bbec5afc4`
BLAKE2b-256	`d6fb0bffccc09ca1e06398f4f4181162bda14975d999ba162fca5fee39b12c0c`

See more details on using hashes here.

tfidf-zones 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tfidf-zones

Install

Usage

Recipes

Options

How It Works

Engines

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes