Skip to main content

A Python library for readability and textual metrics analysis, supporting multiple languages.

Project description

SmoothText


Tests license versions pypi downloads


Introduction

SmoothText is a Python library for calculating readability scores of texts and statistical information for texts in multiple languages.

The design principle of this library is to ensure high accuracy.

Requirements

Python Version

Python 3.10 or higher.

External Dependencies

Library Version License Notes
NLTK >=3.9.1 Apache 2.0 Conditionally optional.
Stanza >=1.10.1 Apache 2.0 Conditionally optional.
CMUdict >=1.0.32 GPLv3+ Required if Stanza is the selected backend.
Unidecode >=1.3.8 GNU GPLv2 Required.
Pyphen >=0.17.0 GPL 2.0+/LGPL 2.1+/MPL 1.1 Required.
emoji >=2.14.1 BSD Required.

Either NLTK or Stanza must be installed and used with the SmoothText library.

Features

Readability Analysis

SmoothText can calculate readability scores of text in the following languages, using the following formulas.

Method Description
compute_readability Computes the readability score of a text using a specified formula.

English

Method Formula Authors Notes
automated_readability_index Automated Readability Index Smith & Senter, 1967 -
flesch_reading_ease Flesch Reading Ease Flesch, 1948 -
flesch_kincaid_grade Flesch-Kincaid Grade Kincaid et al., 1975 -
flesch_kincaid_grade_simplified Flesch-Kincaid Grade Simplified Kincaid et al., 1975 Essentially, the same as Flesch-Kincaid Grade. However, the output will be rounded due to the constant rounding.
gunning_fog_index Gunning Fog Index Gunning, 1952 -

Notes:

  • Although SmoothText supports both US English and GB English, formulas work best with US English.

German

Method Formula Authors Notes
amstad Flesch Reading Ease Amstad, 1978 German adaptation of Flesch Reading Ease.
wiener_sachtextformel Wiener Sachtextformel Bamberger & Vanecek, 1984 German adaptation of Flesch-Kincaid Grade. All versions (1 through 4) are supported.

Russian

Method Formula Authors Notes
matskovskiy Matskovskiy Matskovskiy, 1976 German adaptation of Flesch Reading Ease.

Turkish

Method Formula Authors Notes
atesman Ateşman Ateşman, 1997 Turkish adaptation of Flesch Reading Ease.
bezirci_yilmaz Bezirci-Yılmaz Bezirci & Yılmaz, 2010 Turkish adaptation of Flesch-Kincaid Grade.

Sentencizing, Tokenization, and Syllabification

SmoothText can extract sentences, words, or syllables from texts.

Method Description
Sentence Level
sentencize Splits text into sentences using language-aware rules
count_sentences Returns the number of sentences found in the text
Word Level
tokenize Extracts word tokens from text; can group by sentences with the split_sentences flag
count_words Counts the number of alphanumeric words in a text
word_frequencies Returns a dictionary of word frequencies with optional lemmatization
Syllable Level
syllabify Splits words into syllables; can be applied to words, tokens, or sentences
count_syllables Counts syllables in words or text using language-specific rules
syllable_frequencies Returns a dictionary mapping syllable counts to frequency in the analyzed text
Character Level
count_consonants Counts the number of consonant characters in text
count_vowels Counts the number of vowel characters in text
Emoji Handling
demojize Converts emoji characters to their text descriptions with custom delimiters
remove_emojis Removes all emoji characters from text

Notes

  • count_syllables is likely to produce more accurate results in comparison to the syllabify method.
  • At the moment, lemmatization is only supported for English with the Stanza as the backend. Other languages and backends will ignore the lemmatization flag.
Language Sentencizing Tokenization Syllabification
English
(NLTK, Stanza)

(NLTK, Stanza)

(CMU Dictionary, Pyphen)
German
(NLTK, Stanza)

(NLTK, Stanza)

(Pyphen)
Russian
(NLTK, Stanza)

(NLTK, Stanza)

(Pyphen)
Turkish
(NLTK, Stanza)

(NLTK, Stanza)

(Custom formula)

Pyphen may not produce accurate results sometimes. Thus, whenever possible, custom syllabification formulas or dictionaries are preferred.

Reading Time

SmoothText can calculate how long would a text take to read. The reading time is calculated based on the average reading speed of an adult.

Method Description
reading_aloud_time Calculates the reading time of a text.
reading_time Calculates the reading time of a text.
silent_reading_time Calculates the silent reading time.

Installation

You can install SmoothText via pip.

pip install smoothtext

Usage

Importing and Initializing the Library

SmoothText comes with four submodules: Backend, Language, ReadabilityFormula and SmoothText.

from smoothtext import Backend, Language, ReadabilityFormula, SmoothText

Instancing

SmoothText was not designed to be used with static methods. Thus, an instance must be created to access its methods.

When creating an instance, the language and the backend to be used with it can be specified.

The following will create a new SmoothText instance configured to be used with the English language (by default, the United States variant) using NLTK as the backend.

st = SmoothText('en', 'nltk')

Once an instance is created, its backend cannot be changed, but its working language can be changed at any time.

st.language = 'tr'  # Now configured to work with Turkish.
st.language = 'en-gb'  # Switching back to English, but to the United Kingdom variant.

Readying the Backends

When an instance is created, the instance will first attempt to import and download the required backend/language data. To avoid this, and to prepare the required packages in advance, we can use the static SmoothText.prepare() method.

SmoothText.prepare('nltk', 'en,tr')  # Preparing NLTK to be used with English and Turkish

Computing Readability Scores

Each language has its own set of readability formulas. When computing the readability score of a text in a language, one of the supporting formulas must be used. Using SmoothText, there are three ways to perform this calculation.

text: str = 'Forrest Gump is a 1994 American comedy-drama film directed by Robert Zemeckis.'  # https://en.wikipedia.org/wiki/Forrest_Gump

# Generic computation method
st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)

# Using instance as a callable for generic computation
st(text, ReadabilityFormula.Flesch_Reading_Ease)

# Specific formula method
st.flesch_reading_ease(text)

Tokenizing and Calculating Text Statistics

SmoothText is designed to work with sentences, words/tokens, and syllables.

Other Features

Refer to the documentation for a complete list of available methods.

Inconsistencies

Backend Related Inconsistencies

  • NLTK and Stanza have different tokenization rules. This may cause differences in the number of tokens/sentences between the two backends.

Language Related Inconsistencies

  • The syllabification of words may differ within the same language variant. For example, the word "hello" has two syllables in American English but one in British English. See the code snippet below.
    • To avoid this as much as possible, CMUdict is used for English as the default syllabification method. However, it may not be available in some cases. In such cases, Pyphen will be used as a fallback.
from pyphen import Pyphen

us = Pyphen(lang="en_US")
print(us.inserted("hello"))
# Output: 'hel-lo'

gb = Pyphen(lang="en_GB")
print(gb.inserted("hello"))
# Output: 'hello'

Documentation

See here for API documentation.

License

SmoothText has an MIT license. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smoothtext-0.4.0.tar.gz (33.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smoothtext-0.4.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file smoothtext-0.4.0.tar.gz.

File metadata

  • Download URL: smoothtext-0.4.0.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for smoothtext-0.4.0.tar.gz
Algorithm Hash digest
SHA256 e9efd1c6c45ef33a493e9354a43d18df68211fcb0fc972e5a9379022d4599e1e
MD5 6675d1780615c620001c74128bb62351
BLAKE2b-256 6b60101b20c7b9fe77743ea9f2c69764ccf97be13e43f269005707a4af683e65

See more details on using hashes here.

File details

Details for the file smoothtext-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: smoothtext-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for smoothtext-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a3e7674005248381de1c473fef40c7c6d3ffcff1fda31e27446c267a52c4d4f
MD5 150ebe0494a41a7fa1bbbd27d3a6b6e1
BLAKE2b-256 964ce2a502fe86522f462a7953650a835d68fb0d88b51bd5717da43ee228c66a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page