Skip to main content

Pure-Python text analysis: readability, vocabulary richness, sentiment, n-grams

Project description

textstat-py

Text analysis for Python. Zero dependencies.

NLTK is 40MB and requires a corpus download just to tokenize a sentence. textblob pulls in NLTK. spaCy needs a 50MB model file before it'll tell you anything. For most text analysis tasks — readability scores, vocabulary stats, sentiment, writing quality signals — none of that weight is necessary.

pip install textstat-py

What it does

$ textstat essay.txt
=== Text Statistics: essay.txt ===
  Words            : 1243
  Sentences        : 67
  Reading time     : 6.2 min
  Flesch ease      : 58.4  (0=hard, 100=easy)
  FK grade level   : 11.2
  Grade consensus  : 10.8  (avg of 4 formulas)
  Lexical diversity: 0.71  (unique/total words)
  Sentiment        : neutral  (polarity=0.02)
  Passive voice    : 0.18  (fraction of sentences)
  Adverb density   : 0.031  (>0.05 may signal weak verbs)
  Top words        : data(18), model(14), training(11), loss(9), layer(7)

Compare two versions of the same document:

$ textstat --compare draft.txt final.txt
Metric                      A: draft.txt          B: final.txt          Delta
------------------------------------------------------------------------
Words                       1891                  1243                  -648
Reading time (min)          9.46                  6.21                  -3.25
Flesch ease                 44.1                  58.4                  +14.3
Grade level                 13.2                  10.8                  -2.4
Passive voice ratio         0.31                  0.18                  -0.13
Adverb density              0.071                 0.031                 -0.04

Install

pip install textstat-py

Python 3.8+. No dependencies. Single file.

CLI

textstat document.txt          # full report
textstat --json document.txt   # JSON output
textstat --wpm 250 document.txt  # custom reading speed
textstat --compare before.txt after.txt  # side-by-side diff
cat text.txt | textstat        # stdin

Python API

from textstat import analyze, flesch_reading_ease, grade_level_consensus

text = open("essay.txt").read()

# Quick scores
print(flesch_reading_ease(text))    # 58.4
print(grade_level_consensus(text))  # 10.8

# Full analysis dict
stats = analyze(text)
print(stats["passive_voice_ratio"])  # 0.18
print(stats["adverb_density"])       # 0.031
print(stats["top_words"])            # [("data", 18), ("model", 14), ...]

Functions

Readability

  • flesch_reading_ease(text) — 0–100, higher = easier
  • flesch_kincaid_grade(text) — US grade level
  • gunning_fog(text) — years of education needed
  • coleman_liau_index(text)
  • automated_readability_index(text)
  • smog_index(text)
  • grade_level_consensus(text) — mean across all grade formulas

Writing quality

  • passive_voice_ratio(text) — fraction of sentences with passive constructions
  • adverb_density(text) — fraction of words that are -ly adverbs (>0.05 is a signal)

Vocabulary

  • lexical_diversity(text) — type-token ratio
  • mattr(text, window=100) — moving-average TTR, stable for long texts
  • herdan_c(text), yule_k(text) — length-robust vocabulary richness
  • hapax_legomena_ratio(text) — fraction of words appearing exactly once
  • vocabulary_richness(text) — all of the above as a dict

Counts & structure

  • count_words(text), count_sentences(text), count_paragraphs(text)
  • reading_time(text, wpm=200)
  • sentence_stats(text) — min/max/mean/median sentence length
  • paragraph_stats(text) — word counts per paragraph

Sentiment

  • sentiment_polarity(text) — −1.0 to +1.0, lexicon-based, no model needed
  • sentiment_label(text) — "positive" / "neutral" / "negative"

N-grams

  • top_ngrams(text, n=2, k=10) — most frequent n-grams
  • ngram_diversity(text, n=2) — unique n-grams / total positions
  • ngram_stats(text) — bigrams + trigrams bundled

Misc

  • top_words(text, n=10) — most frequent non-stopword words
  • word_frequency_distribution(text) — total tokens, unique types, Zipf fit
  • text_density(text) — content words / total words

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textstat_py-0.2.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textstat_py-0.2.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file textstat_py-0.2.0.tar.gz.

File metadata

  • Download URL: textstat_py-0.2.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for textstat_py-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b6235b3d06b561f44bd3097ffd77d1b7bcd01fb0dcf201bd283b80dec894e5df
MD5 98e8da3b46b0f068ec554a96cf210c90
BLAKE2b-256 499a607cc81bdd2d1b3c74f939862172e476892be6f81ad2ef11525e3756adc1

See more details on using hashes here.

File details

Details for the file textstat_py-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: textstat_py-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for textstat_py-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de27a2732a75550aaeac0c6b2292ce1bd0ba9d6db14eab1732f63d6b60da6af1
MD5 8e522b280bc051566af9d909e04e83e4
BLAKE2b-256 b32235dcd41a80fcb52b3f1caeb85b28320a730ce910262ffa01a472211c517a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page