readability·PyPI

Measure the readability of a given text using surface characteristics

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
- Web Environment
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- POSIX
Programming Language
Topic
- Text Processing :: Linguistic

Project description

An implementation of traditional readability measures based on simple surface characteristics. These measures are basically linear regressions based on the number of words, syllables, and sentences.

The functionality is modeled after the UNIX style(1) command. Compared to the implementation as part of GNU diction, this version supports UTF-8 encoded text, but expects sentence-segmented and word-tokenized text. The syllabification and word type recognition are based on simple heuristics and only provides a rough measure. The supported languages are English, German, and Dutch. Adding support for a new language involves the addition of heuristics for the aforementioned syllabification and word type recognition; see langdata.py.

NB: all readability formulas were developed for English, so the scales of the outcomes are only meaningful for English texts. The Dale-Chall measure uses the original word list for English, but for Dutch and German lists of frequent words are used that were not specifically selected for recognizability by school children.

For syntactic complexity measures, see udstyle

Installation

$ pip install https://github.com/andreasvc/readability/tarball/master

Usage

The following preprocessing is expected:

Tokens (words or punctuation) separated by space
One sentence per line; no line breaks within sentences
Paragraphs separated by one empty line

The quality of preprocessing affects the validity of the results.

From Python; tokenization using syntok:

>>> import readability
>>> import syntok.segmenter as segmenter
>>> text = """
This is an example sentence. Note that tokens will be separated by spaces
and sentences by newlines.

This is the second paragraph."""
>>> tokenized = '\n\n'.join(
     '\n'.join(' '.join(token.value for token in sentence)
        for sentence in paragraph)
     for paragraph in segmenter.analyze(text))
>>> print(tokenized)
This is an example sentence .
Note that tokens will be separated by spaces and sentences by newlines .

This is the second paragraph .
>>> results = readability.getmeasures(tokenized, lang='en')
>>> print(results['readability grades']['FleschReadingEase'])
68.64621212121216

Command line usage:

$ readability --help
Simple readability measures.

Usage: readability [--lang=<x>] [FILE]
or: readability [--lang=<x>] --csv FILES...

By default, input is read from standard input.
Text should be encoded with UTF-8,
one sentence per line, tokens space-separated.

Options:
  -L, --lang=<x>   Set language (available: de, nl, en).
  --csv            Produce a table in comma separated value format on
                   standard output given one or more filenames.
  --tokenizer=<x>  Specify a tokenizer including options that will be given
                   each text on stdin and should return tokenized output on
                   stdout. Not applicable when reading from stdin.

Recommended tokenizers:

For English and German, I recommend “tokenizer”, cf. http://moin.delph-in.net/WeSearch/DocumentParsing
For Dutch, I recommend the tokenizer that is part of the Alpino parser: http://www.let.rug.nl/vannoord/alp/Alpino/.
ucto is a general multilingual tokenizer: http://ilk.uvt.nl/ucto

Example using ucto:

$ ucto -L en -n -s "''" "CONRAD, Joseph - Lord Jim.txt" | readability
[...]
readability grades:
    Kincaid:                          5.44
    ARI:                              6.39
    Coleman-Liau:                     6.91
    FleschReadingEase:               85.17
    GunningFogIndex:                  9.86
    LIX:                             31.98
    SMOGIndex:                        9.39
    RIX:                              2.56
    DaleChallIndex:                   8.02
sentence info:
    characters_per_word:              4.17
    syll_per_word:                    1.24
    words_per_sentence:              16.35
    sentences_per_paragraph:         11.5
    type_token_ratio:                 0.09
    characters:                  551335
    syllables:                   164205
    words:                       132211
    wordtypes:                    12071
    sentences:                     8087
    paragraphs:                     703
    long_words:                   20670
    complex_words:                10990
    complex_words_dc:             29908
word usage:
    tobeverb:                      3907
    auxverb:                       1630
    conjunction:                   4398
    pronoun:                      18092
    preposition:                  19290
    nominalization:                1167
sentence beginnings:
    pronoun:                       2578
    interrogative:                  217
    article:                        629
    subordination:                  120
    conjunction:                    236
    preposition:                    397

The option --csv collects readability measures for a number of texts in a table. To tokenize documents on-the-fly when using this option, use the --tokenizer option. Example with the “tokenize” tool:

$ readability --csv --tokenizer='tokenizer -L en-u8 -P -S -E "" -N' */*.txt >readabilitymeasures.csv

References

The following readability metrics are included:

For better readability measures, consider the following:

Collins-Thompson & Callan (2004). A language modeling approach to predicting reading difficulty. In Proc. of HLT/NAACL, pp. 193-200. http://aclweb.org/anthology/N04-1025.pdf
Schwarm & Ostendorf (2005). Reading level assessment using SVM and statistical language models. Proc. of ACL, pp. 523-530. http://www.aclweb.org/anthology/P05-1065.pdf
The Lexile framework for reading. http://www.lexile.com
Coh-Metrix. http://cohmetrix.memphis.edu/
Stylene: http://www.clips.ua.ac.be/category/projects/stylene
T-Scan: http://languagelink.let.uu.nl/tscan

Acknowledgments

The code is based on: https://github.com/mmautner/readability

Which in turn was based on: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
- Web Environment
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- POSIX
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.3.2

Jan 14, 2025

0.3.1

Jan 13, 2019

0.3

Jul 21, 2018

0.2

Aug 11, 2015

0.1

Apr 13, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-0.3.2.tar.gz (36.1 kB view details)

Uploaded Jan 14, 2025 Source

File details

Details for the file readability-0.3.2.tar.gz.

File metadata

Download URL: readability-0.3.2.tar.gz
Upload date: Jan 14, 2025
Size: 36.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for readability-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`5aace888855cb3ef1b7dd059e41bbc6ac1f7daba321b2d24062ca75fdf6e576d`
MD5	`8ef62ec7fd25de17669f6f4802756a09`
BLAKE2b-256	`6d1093cf95f579e43042b45d43c832aff7c55490ab1b38c6ce46f2f35245281f`

See more details on using hashes here.

readability 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

References

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes