Multi-Dimensional Readability (MDR) score and 42 linguistic features for English text
Project description
mdr-readability
A Python package for computing the MDR (Multi-Dimensional Readability) score and 42 linguistic features from English text.
MDR is a regression-based readability index that combines lexical, syntactic, and semantic features to predict text difficulty. It achieves R² = 0.9249 on the calibration corpus.
Installation
1. Install the package
pip install mdr-readability
Or, for development (editable install):
git clone https://github.com/jacktanhua/MDR.git
cd mdr-readability
pip install -e .
2. Install the spaCy model
python -m spacy download en_core_web_lg
3. Install NLTK data (first time only)
import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")
4. Place the vocabulary data files
Two external data files are required:
| File | Description |
|---|---|
The New Dale-Chall Familiar Words List_2950.txt |
Dale-Chall familiar word list |
cefrj-vocabulary-profile-1.5.csv |
CEFR vocabulary profile (must have headword and CEFR columns) |
Optional (for norm scores):
| File | Description |
|---|---|
MDR_level_features_norm.csv |
Level-specific norm values (must have Code column + L1–L12 columns) |
By default the package looks for these files in <package_root>/data/.
You can point it to any directory:
import mdr_readability
mdr_readability.set_data_dir("/path/to/your/data")
Quick Start
import mdr_readability
# Point to your data directory (skip if files are in the default location)
mdr_readability.set_data_dir("D:/MDR/data")
text = "The cat sat on the mat. It was a very small cat."
# --- Option 1: MDR score + 12 classic indices ---
df = mdr_readability.compute_mdr(text)
print(df)
# --- Option 2: MDR score + all 42 raw features + norm values ---
df_norm = mdr_readability.compute_mdr_with_norm(text)
print(df_norm.T) # Transposing makes it easier to read
# --- Option 3: Step by step ---
features_df = mdr_readability.calculate_features(text)
features_df = mdr_readability.calculate_mdr_readability(features_df)
print(features_df["MDR"].iloc[0])
# --- Classic readability only ---
scores = mdr_readability.calculate_classic_readability(text)
labels = [
"Flesch Reading Ease", "Flesch Kincaid Grade", "Gunning Fog",
"SMOG Index", "Automated Readability", "Coleman Liau",
"Linsear Write", "Dale Chall", "Spache", "Rix", "Lix", "Text Standard"
]
for label, score in zip(labels, scores):
print(f"{label}: {score}")
API Reference
mdr_readability.set_data_dir(path)
Set the directory from which vocabulary data files are loaded.
Call once before any computation when your data files are not in <package_root>/data/.
mdr_readability.calculate_features(text) → pd.DataFrame
Extract all 42 linguistic features from text.
Returns a single-row DataFrame with columns listed in mdr_readability.COLUMN_NAMES.
Feature categories:
| Category | Features (count) |
|---|---|
| Syllable | 7 |
| Word length & characters | 4 |
| Lexical difficulty | 6 |
| Type-token ratio / frequency | 3 |
| Sentence length | 2 |
| Dependency / syntax | 7 |
| Passive voice | 2 |
| Semantic | 7 |
| Referencing / conjunction | 4 |
mdr_readability.calculate_mdr_readability(df) → pd.DataFrame
Apply the MDR linear regression formula to a features DataFrame.
Returns a copy of df with an extra "MDR" column (rounded to 4 d.p.).
mdr_readability.calculate_classic_readability(text) → list
Return a list of 12 classic readability scores in this order:
[Flesch Reading Ease, Flesch Kincaid Grade, Gunning Fog, SMOG Index,
Automated Readability Index, Coleman Liau Index, Linsear Write Formula,
Dale Chall Readability Score, Spache Readability, RIX, LIX, Text Standard]
mdr_readability.compute_mdr(text) → pd.DataFrame
One-step convenience function.
Returns a single-row DataFrame with MDR plus all 12 classic indices.
mdr_readability.compute_mdr_with_norm(text) → pd.DataFrame
One-step convenience function with norm comparison.
Returns a single-row DataFrame with:
MDR_value- All 42 raw feature values (
<feature_name>) - Corresponding norm values (
<feature_name>_norm) —Noneif norm file is absent
Feature Codes
Each feature has a short code used in the norm CSV:
| Code | Feature name |
|---|---|
| MWLS | avg_syllables_per_word_spacy |
| WO2S | words_over_2_syllables |
| W2SR | words_over_2_syllables_ratio |
| W2SE | words_over_2_syllables_entropy |
| W2SS | words_over_2_syllables_per_30_sentences |
| OSW1 | one_syllable_words_per_150 |
| OSW2 | one_syllable_words_per_100 |
| MWLL | Mean Word Length Refined |
| MLW | average_letters_per_100_words |
| MSW | Mean Sentence per Word |
| WLE | Word Length Entropy |
| DWR | Difficult Words Ratio |
| DWE | difficult_words_entropy |
| WLEC | word_level_entropy_CERF |
| MLF | Mean Lexical Frequency |
| LR | Lexical Richness Entropy |
| STTR | Standard Type Token Ratio |
| WZE | Word Zipf Entropy |
| MSL | Mean Sentence Length |
| SLE | Sentence Length Entropy |
| MDD | Mean Dependency Distance |
| DDE | Dependency Distance Entropy |
| DTE | Dependency Distribution Entropy |
| SED | Syntax Entropy Dependency |
| SEP | Syntax Entropy POS |
| SEC | syntax_entropy_component |
| PSR | Passive Sentence Ratio |
| PDE | Passive Dependency Entropy |
| TE | Topic Entropy |
| SE | Semantic Entropy |
| SR | Semantic Richness |
| SAN | Semantic Accuracy Noun |
| SAV | Semantic Accuracy Verb |
| SANV | Semantic Accuracy Noun_Verb |
| SACW | Semantic Accuracy Content Words |
| SC | Semantic Clarity |
| DSE | Descriptive Style Entropy |
| POE | POS Entropy |
| RE_I | Referencing Entropy I |
| RE_II | Referencing Entropy II |
| RE_III | Referencing Entropy III |
| CE | Conjunction Entropy I |
Dependencies
spacy >= 3.0+en_core_web_lgmodelspacy-syllablesnltk(punkt, wordnet, averaged_perceptron_tagger)textstatwordfreqnumpy,pandas
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdr_readability-1.0.0.tar.gz.
File metadata
- Download URL: mdr_readability-1.0.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e764b843345c40b2ea62f6fba45d08982383bae0153011b8f8e1c1fe4bcd8226
|
|
| MD5 |
017e666895e652e1d717768ab7e7d3ab
|
|
| BLAKE2b-256 |
0a2668a5c1ae5374795dcff554f8fabc974ba9e2d2f58fbb1d6a758b2fdd1c23
|
File details
Details for the file mdr_readability-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mdr_readability-1.0.0-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fc209cf08b8749b38e1ae14c396c3fb602b268946db19384885e7d4a06d48c9
|
|
| MD5 |
8356f7c5c3c967bf47aeafc57f8d9e35
|
|
| BLAKE2b-256 |
3637ea5251bb064e98998085df92dc5afc20614486d07f96c87dcc961bd6680b
|