Skip to main content

Multi-Dimensional Readability (MDR) score and 42 linguistic features for English text

Project description

mdr-readability

A Python package for computing the MDR (Multi-Dimensional Readability) score and 42 linguistic features from English text.

MDR is a regression-based readability index that combines lexical, syntactic, and semantic features to predict text difficulty. It achieves R² = 0.9249 on the calibration corpus.


Installation

1. Install the package

pip install mdr-readability

Or, for development (editable install):

git clone https://github.com/jacktanhua/MDR.git
cd mdr-readability
pip install -e .

2. Install the spaCy model

python -m spacy download en_core_web_lg

3. Install NLTK data (first time only)

import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

4. Place the vocabulary data files

Two external data files are required:

File Description
The New Dale-Chall Familiar Words List_2950.txt Dale-Chall familiar word list
cefrj-vocabulary-profile-1.5.csv CEFR vocabulary profile (must have headword and CEFR columns)

Optional (for norm scores):

File Description
MDR_level_features_norm.csv Level-specific norm values (must have Code column + L1L12 columns)

By default the package looks for these files in <package_root>/data/.
You can point it to any directory:

import mdr_readability
mdr_readability.set_data_dir("/path/to/your/data")

Quick Start

import mdr_readability

# Point to your data directory (skip if files are in the default location)
mdr_readability.set_data_dir("D:/MDR/data")

text = "The cat sat on the mat. It was a very small cat."

# --- Option 1: MDR score + 12 classic indices ---
df = mdr_readability.compute_mdr(text)
print(df)

# --- Option 2: MDR score + all 42 raw features + norm values ---
df_norm = mdr_readability.compute_mdr_with_norm(text)
print(df_norm.T)  # Transposing makes it easier to read

# --- Option 3: Step by step ---
features_df = mdr_readability.calculate_features(text)
features_df = mdr_readability.calculate_mdr_readability(features_df)
print(features_df["MDR"].iloc[0])

# --- Classic readability only ---
scores = mdr_readability.calculate_classic_readability(text)
labels = [
    "Flesch Reading Ease", "Flesch Kincaid Grade", "Gunning Fog",
    "SMOG Index", "Automated Readability", "Coleman Liau",
    "Linsear Write", "Dale Chall", "Spache", "Rix", "Lix", "Text Standard"
]
for label, score in zip(labels, scores):
    print(f"{label}: {score}")

API Reference

mdr_readability.set_data_dir(path)

Set the directory from which vocabulary data files are loaded.
Call once before any computation when your data files are not in <package_root>/data/.


mdr_readability.calculate_features(text) → pd.DataFrame

Extract all 42 linguistic features from text.
Returns a single-row DataFrame with columns listed in mdr_readability.COLUMN_NAMES.

Feature categories:

Category Features (count)
Syllable 7
Word length & characters 4
Lexical difficulty 6
Type-token ratio / frequency 3
Sentence length 2
Dependency / syntax 7
Passive voice 2
Semantic 7
Referencing / conjunction 4

mdr_readability.calculate_mdr_readability(df) → pd.DataFrame

Apply the MDR linear regression formula to a features DataFrame.
Returns a copy of df with an extra "MDR" column (rounded to 4 d.p.).


mdr_readability.calculate_classic_readability(text) → list

Return a list of 12 classic readability scores in this order:

[Flesch Reading Ease, Flesch Kincaid Grade, Gunning Fog, SMOG Index,
 Automated Readability Index, Coleman Liau Index, Linsear Write Formula,
 Dale Chall Readability Score, Spache Readability, RIX, LIX, Text Standard]

mdr_readability.compute_mdr(text) → pd.DataFrame

One-step convenience function.
Returns a single-row DataFrame with MDR plus all 12 classic indices.


mdr_readability.compute_mdr_with_norm(text) → pd.DataFrame

One-step convenience function with norm comparison.
Returns a single-row DataFrame with:

  • MDR_value
  • All 42 raw feature values (<feature_name>)
  • Corresponding norm values (<feature_name>_norm) — None if norm file is absent

Feature Codes

Each feature has a short code used in the norm CSV:

Code Feature name
MWLS avg_syllables_per_word_spacy
WO2S words_over_2_syllables
W2SR words_over_2_syllables_ratio
W2SE words_over_2_syllables_entropy
W2SS words_over_2_syllables_per_30_sentences
OSW1 one_syllable_words_per_150
OSW2 one_syllable_words_per_100
MWLL Mean Word Length Refined
MLW average_letters_per_100_words
MSW Mean Sentence per Word
WLE Word Length Entropy
DWR Difficult Words Ratio
DWE difficult_words_entropy
WLEC word_level_entropy_CERF
MLF Mean Lexical Frequency
LR Lexical Richness Entropy
STTR Standard Type Token Ratio
WZE Word Zipf Entropy
MSL Mean Sentence Length
SLE Sentence Length Entropy
MDD Mean Dependency Distance
DDE Dependency Distance Entropy
DTE Dependency Distribution Entropy
SED Syntax Entropy Dependency
SEP Syntax Entropy POS
SEC syntax_entropy_component
PSR Passive Sentence Ratio
PDE Passive Dependency Entropy
TE Topic Entropy
SE Semantic Entropy
SR Semantic Richness
SAN Semantic Accuracy Noun
SAV Semantic Accuracy Verb
SANV Semantic Accuracy Noun_Verb
SACW Semantic Accuracy Content Words
SC Semantic Clarity
DSE Descriptive Style Entropy
POE POS Entropy
RE_I Referencing Entropy I
RE_II Referencing Entropy II
RE_III Referencing Entropy III
CE Conjunction Entropy I

Dependencies

  • spacy >= 3.0 + en_core_web_lg model
  • spacy-syllables
  • nltk (punkt, wordnet, averaged_perceptron_tagger)
  • textstat
  • wordfreq
  • numpy, pandas

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdr_readability-1.0.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdr_readability-1.0.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file mdr_readability-1.0.0.tar.gz.

File metadata

  • Download URL: mdr_readability-1.0.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for mdr_readability-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e764b843345c40b2ea62f6fba45d08982383bae0153011b8f8e1c1fe4bcd8226
MD5 017e666895e652e1d717768ab7e7d3ab
BLAKE2b-256 0a2668a5c1ae5374795dcff554f8fabc974ba9e2d2f58fbb1d6a758b2fdd1c23

See more details on using hashes here.

File details

Details for the file mdr_readability-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mdr_readability-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7fc209cf08b8749b38e1ae14c396c3fb602b268946db19384885e7d4a06d48c9
MD5 8356f7c5c3c967bf47aeafc57f8d9e35
BLAKE2b-256 3637ea5251bb064e98998085df92dc5afc20614486d07f96c87dcc961bd6680b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page