Skip to main content

First practical Python implementation with Devanagari-specific features for Hindi text readability

Project description

hindi-readability 📖

First practical Python implementation with Devanagari-specific features for Hindi text readability.

PyPI version Python License: MIT Tests HuggingFace

Zero external dependencies. Pure Python 3.8+. Works out of the box.


Why does this exist?

English has Flesch-Kincaid, Gunning Fog, and ARI readability formulas built into MS Word since 1992.

Hindi has nothing.

India has 24.8 crore school students, 886 million internet users consuming Hindi content, and 14.7 lakh schools — all producing Hindi text with no way to automatically measure whether it is easy or hard to read.

This package fills that gap with three original formulas designed from scratch for Devanagari script.


Installation

pip install hindi-readability

Quick Start

from hindi_readability import ReadabilityScorer

rs = ReadabilityScorer()

result = rs.score("यह एक सरल वाक्य है।")
print(result["hrs"])           # 75.2  — Hindi Readability Score (0–100)
print(result["label"])         # Easy
print(result["grade_label"])   # Class 6–8
print(result["cbse_level"])    # Madhyamik
print(result["hci"])           # 0.35  — Complexity Index (0–1)

Real Examples

from hindi_readability import ReadabilityScorer
rs = ReadabilityScorer()

# Easy — children's level
easy = "यह एक बच्चा है। वह खेलता है। घर अच्छा है। माँ पानी लाई।"
r = rs.score(easy)
# hrs = 75.2  label = 'Easy'  grade_label = 'Class 6–8'

# Standard newspaper Hindi
medium = "भारत में शिक्षा का स्तर तेजी से बदल रहा है। सरकार नई नीतियां बना रही है।"
r = rs.score(medium)
# hrs = 55.8  label = 'Standard'  grade_label = 'Class 9–10'

# Expert constitutional Hindi
hard = "संविधान की प्रस्तावना में भारत को एक संप्रभु, समाजवादी, धर्मनिरपेक्ष, लोकतांत्रिक गणराज्य घोषित किया गया है।"
r = rs.score(hard)
# hrs = 0.0  label = 'Expert'  grade_label = 'College+'

Full API Reference

rs.score(text) → dict

result = rs.score("यह सरल पाठ है।")
# {
#   "hrs": 75.2, "label": "Easy", "description": "Suitable for Class 3–5 students",
#   "grade": 7, "grade_label": "Class 6–8", "cbse_level": "Madhyamik",
#   "hci": 0.35, "syllables_per_word": 1.77, "conjunct_density": 15.4,
#   "raw": { "words": 13, "sentences": 4, "syllables": 23, "matras": 11,
#            "conjuncts": 2, "viramas": 2, "consonants": 18, ... }
# }

rs.compare(texts) → list sorted easiest first

ranked = rs.compare(["कठिन संवैधानिक पाठ।", "बच्चे खेलते हैं।", "भारत की नीति।"])
for r in ranked:
    print(f"HRS={r['hrs']:5.1f}  {r['label']:10}  {r['text'][:40]}")
# HRS= 91.2  Very easy   बच्चे खेलते हैं।
# HRS= 55.8  Standard    भारत की नीति।
# HRS=  0.0  Expert      कठिन संवैधानिक पाठ।

rs.batch_score(texts) → list in original order

results = rs.batch_score([text1, text2, text3])

rs.is_appropriate_for_grade(text, grade) → bool

rs.is_appropriate_for_grade("यह सरल पाठ है।", grade=5)       # True
rs.is_appropriate_for_grade("संवैधानिक प्रावधान।", grade=5)  # False

rs.simplify_suggestions(text) → list of Hindi suggestions

suggestions = rs.simplify_suggestions("संवैधानिक प्रावधानों के अनुसार...")
# ["संयुक्त अक्षरों वाले शब्द कम करें — तत्सम शब्दों की जगह तद्भव शब्द लिखें।",
#  "वाक्य छोटे करें — एक वाक्य में 10–12 से अधिक शब्द न रखें।"]

Low-level functions

from hindi_readability import (
    hindi_readability_score,  # float 0–100
    hindi_grade_level,        # dict {grade, grade_label, cbse_level}
    hindi_complexity_index,   # float 0–1
    analyse,                  # dict of raw Devanagari script counts
    syllables_per_word,       # float
    conjunct_density,         # conjuncts per 100 words
)

HRS Score Interpretation

Score Label Suitable for
90–100 Very easy Class 1–2 (Prathmik)
70–89 Easy Class 3–5 (Prathmik Uttara)
50–69 Standard Class 6–8 (Madhyamik)
30–49 Difficult Class 9–10 (Uccha Madhyamik)
10–29 Very hard Class 11–12 (Uccha Vidyalay)
0–9 Expert College+ (Snatak)

How the Formulas Work

Why English formulas fail on Hindi

English readability tools count syllables and word length. Hindi requires three features English simply does not have:

Matras (मात्राएँ) — Vowel signs attached to consonants (ि ी ु ू ा े ै ो ौ). Long matras indicate heavier syllables. English formulas cannot detect these.

Conjunct consonants (संयुक्त अक्षर) — Two consonants fused by a virama (्), for example क्ष, त्र, ज्ञ, प्र. These appear mainly in Sanskrit-origin vocabulary which is significantly harder for younger readers. This is the single biggest marker of Hindi difficulty and has no equivalent in English.

Devanagari syllable rules — Every Hindi consonant carries an implicit /a/ vowel unless killed by a virama. Standard English syllable-counting rules are completely blind to this.

HRS Formula

HRS = 206.0
      − (60.0 × avg syllables per word)
      − (1.8  × avg words per sentence)
      − (70.0 × conjunct density)
      − (8.0  × matra complexity)

HGL Formula

HGL = 17.2 − (HRS × 0.14)   →   CBSE Class 1 to College+

HCI Formula

HCI = 0.40×syllable_score + 0.20×sentence_score + 0.25×conjunct_score + 0.15×matra_score

Validation Results

Validated on a 49-sentence corpus (NCERT Class 1-12, Constitution of India, legal texts, Hindi news).

Metric Result Meaning
Pearson r 0.81 Strong correlation with human judgment
Spearman rho 0.75 Consistent rank ordering
Mean Absolute Error 1.67 grades Less than 2 school grades off on average
Accuracy within 2 grades 73.5% 36 / 49 sentences correctly classified

Run validation yourself:

python validation/validate.py

Open Research Directions

This package provides a baseline. The following are open problems suitable for M.Tech dissertation or research paper:

  • Corpus validation : calibrate formula weights against human-graded Hindi texts (teacher-labeled data)
  • Domain calibration : news vs. textbooks vs. legal vs. social media have different norms
  • Hinglish (code-mixed) : no readability tool handles Hindi-English mixed text yet
  • Extension : Bengali, Marathi, Gujarati use the same Devanagari script family
  • ML-based approach : fine-tune IndicBERT for readability regression and compare against this baseline

Running the Tests

git clone https://github.com/Erprabhat8423/hindi-readability.git
cd hindi-readability
python tests/test_all.py
# Tests: 38/38 passed ✓

Project Structure

hindi-readability/
├── hindi_readability/
│   ├── __init__.py      # Public exports
│   ├── script.py        # Devanagari Unicode analyser
│   ├── formulas.py      # HRS, HGL, HCI implementations
│   └── scorer.py        # ReadabilityScorer public API
├── tests/
│   └── test_all.py      # 38 tests
├── pyproject.toml
└── README.md

Changelog

v0.3.0

  • Corpus-calibrated grade formula on 49-sentence human-graded dataset
  • Statistical proof: Pearson r=0.81, Spearman rho=0.75, MAE=1.67 grades
  • Added data/validation_dataset.csv and validation/validate.py
  • Fixed claim language: first practical implementation, not first ever conceptually
  • Python 3.8 build fix in pyproject.toml

v0.2.0

  • Improved README with full API docs, real examples, formula explanations
  • Added HRS score table with CBSE level names in English
  • Added open research directions section for dissertation reference

v0.1.0

  • Initial release — HRS, HGL, HCI formulas
  • ReadabilityScorer with 5 public methods
  • 38 tests passing, zero external dependencies

Citation

@software{hindi_readability,
  author    = {Prabhat Chaudhary},
  title     = {hindi-readability: The First Python Package for Hindi Text Readability},
  year      = {2026},
  version   = {0.3.0},
  publisher = {PyPI},
  url       = {https://pypi.org/project/hindi-readability/}
}

License

MIT — free for academic and commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hindi_readability-0.3.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hindi_readability-0.3.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file hindi_readability-0.3.0.tar.gz.

File metadata

  • Download URL: hindi_readability-0.3.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for hindi_readability-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4ec1b17fb5c08b88ac89ce5235849dbd0c0451982151661bee40a2bcdf788b92
MD5 2cfbd1f6c6860460f489e4002353437b
BLAKE2b-256 cd8df65500ec1633a66653f512dd9c9b04d9e8c3f85d1cd62804bbb04373bbde

See more details on using hashes here.

File details

Details for the file hindi_readability-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hindi_readability-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f26f7b7daee7a012659098a3eb10bbb9f20478a87089ea3bc5d948ec2d70e68e
MD5 0733b9aaec1daf531074dd3df2b9c108
BLAKE2b-256 2d41499793aa7eb6b9dd397deb8385f906529d451659c4f8c5299624f692083d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page