quarnic nlp

These details have not been verified by PyPI

Project links

Homepage

Project description

QuranicTools: A Python NLP Library for Quranic NLP

Part of Speech Tagging | Dependency Parsing | Lemmatizer | Multilingual Search
| Quranic Extractions | Revelation Order |
Surah Graph Analysis | Translations | Hadiths

Quranic NLP

Quranic NLP is a computational toolbox to conduct various syntactic and semantic analyses of Quranic verses. The aim is to put together all available resources contributing to a better understanding/analysis of the Quran for everyone.

Contents:

Installation
Pipeline
Input Formats
Verse Information
Translations
Similar Verses
Multiple Matches
Word-level Analysis
JSON Output
Surah-Level Graph Analysis
Token Pattern Queries
Cross-Verse Corpus Search
Hadiths
Visualization
Contributors
Contributing

Installation

Step 1 — Install the package

pip install quranic-nlp

Step 2 — Download the data

The library requires data files (~97MB) that are downloaded separately from GitHub Releases:

quranic_data

Or from Python:

from quranic_nlp.data_requirements import download_data
download_data()

Data is downloaded once and stored inside the package directory automatically.

Development Setup

To set up a local development environment:

git clone https://github.com/language-ml/quranic-nlp.git
cd quranic-nlp
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install -e .
quranic_data

Pipeline

Available pipeline components:

Key	Description
`dep`	Dependency parsing
`pos`	Part-of-speech tagging
`root`	Root extraction
`lem`	Lemmatization

from quranic_nlp import language, utils, constant

pips = 'dep,pos,root,lem'

# Basic pipeline — no hadiths fetching (default)
nlp = language.Pipeline(pips, translation_lang='fa#1')

# With hadith fetching enabled (makes one HTTP request per verse — use for single-verse lookups)
nlp_with_hadiths = language.Pipeline(pips, translation_lang='fa#1', hadiths=True)

To see all available translation languages and translators:

utils.print_all_translations()

Input Formats

Four ways to reference a verse or surah:

# 1. surah_number#ayah_number — single Doc (no internet required)
doc = nlp('1#1')

# 2. surah_name#ayah_number — single Doc (requires internet)
doc = nlp('حمد#1')

# 3. surah name or index with surah=True — SurahDoc (all verses of that surah)
surah = nlp('فاتحه', surah=True)   # by Arabic name
surah = nlp(1,       surah=True)   # by integer index
surah = nlp('1',     surah=True)   # by string index

# 4. Free Arabic text — list[Doc] of all matching verses (requires internet)
docs = nlp('رب العالمین')

Verse Information

doc = nlp('1#1')

print(doc._.text)              # بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ  (full diacritics)
print(doc._.simple_text)       # بسم الله الرحمن الرحیم  (no diacritics)
print(doc._.surah)             # فاتحه
print(doc._.ayah)              # 1
print(doc._.revelation_order)  # 5

Note: str(doc) returns the morphologically segmented tokens (e.g. بِ سْمِ اللَّهِ ...), not the original verse text. Use doc._.text for the full verse text with diacritics, or doc._.simple_text for text without diacritics.

Translations

Pass '<lang>#<index>' for a single translator (returns a string):

nlp_en = language.Pipeline(pips, 'en#16')   # Yusuf Ali
doc = nlp_en('1#1')
print(doc._.translations)
# In the name of Allah, the Beneficent, the Merciful.

Pass '<lang>' (no index) for all translators (returns a dict keyed by translator name):

nlp_fa = language.Pipeline(pips, 'fa')
doc = nlp_fa('1#2')
print(doc._.translations)
# {
#   'ansarian': 'همه ستایش ها، ویژه خدا، مالک و مربّی جهانیان است.',
#   'ayati':    'ستايش خدا را كه پروردگار جهانيان است.',
#   'bahrampour': 'ستايش خداى را كه پروردگار جهانيان است',
#   ...   # 12 Persian translators total
# }

Similar Verses

doc._.sim_ayahs returns a list of (ref, score) tuples sorted by similarity score:

doc = nlp('1#2')
for ref, score in doc._.sim_ayahs[:5]:
    print(f'{ref:10s}  score={score:.4f}')

37#182      score=1.0000
6#45        score=0.5199
40#65       score=0.4620
10#10       score=0.3862
39#75       score=0.3793

Multiple Matches

When free Arabic text matches multiple verses, nlp(text) returns a list of docs:

docs = nlp('رب العالمین')
print(f'Found {len(docs)} matching verses')
for doc in docs[:3]:
    print(doc._.surah, doc._.ayah, '—', doc._.text)

فاتحه 2 — الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
مائده 28 — لَئِن بَسَطتَ إِلَيَّ يَدَكَ...
انعام 45 — فَقُطِعَ دَابِرُ الْقَوْمِ...

You can also call search_all explicitly with a max_results cap:

docs = language.search_all(nlp, 'رب العالمین', max_results=5)

Word-level Analysis

doc = nlp('1#1')
word = doc[2]  # third word: اللَّهِ

print(word)                            # اللَّهِ
print(word.pos_)                       # NOUN
print(constant.POS_UNI_FA[word.pos_]) # اسم
print(word.lemma_)                     # ٱللَّه
print(word._.root)                     # اله
print(word.dep_)                       # نعت
print(word._.dep_arc)                  # LTR  (Left-to-Right arc)
print(word.head)                       # رَّحِیمِ

Print a table of all words:

print(f"{'Word':<20} {'POS':<8} {'Lemma':<15} {'Root':<10} {'Dep'}")
print('-' * 65)
for token in doc:
    print(f'{str(token):<20} {token.pos_:<8} {token.lemma_:<15} {str(token._.root):<10} {token.dep_}')

JSON Output

import json

result = language.to_json(pips, doc)
print(json.dumps(result, ensure_ascii=False, indent=2))

[
  {"id": 1, "text": "بِ",      "root": "",    "lemma": "",       "pos": "INTJ", "rel": "مجرور",      "arc": "LTR", "head": "سْمِ"},
  {"id": 2, "text": "سْمِ",   "root": "سمو", "lemma": "ٱسْم",  "pos": "NOUN", "rel": "مضاف الیه", "arc": "LTR", "head": "اللَّهِ"},
  {"id": 3, "text": "اللَّهِ","root": "اله", "lemma": "ٱللَّه","pos": "NOUN", "rel": "نعت",        "arc": "LTR", "head": "رَّحِیمِ"},
  ...
]

Surah-Level Graph Analysis

Pass surah=True to get a SurahDoc — an object containing all verse docs for the surah and tools for graph-based analysis.

from quranic_nlp import language, graph

nlp = language.Pipeline('pos,root,lem', 'fa#1')

# Get all verses of a surah as a SurahDoc (surah=True required)
surah = nlp('فاتحه', surah=True)   # by Arabic name
# surah = nlp(1,   surah=True)     # by integer index
# surah = nlp('1', surah=True)     # by string index

print(f'{surah.surah}: {len(surah)} verses')

# Iterate over verse docs
for doc in surah:
    print(doc._.ayah, doc._.text)

# Build a verse-similarity graph (TF-IDF over surface + lemma + root)
G = surah.build_graph(rep='tfidf')

# Or with a sentence-embedding model (any model with .encode())
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('CAMeL-Lab/bert-base-arabic-camelbert-ca')
# G = surah.build_graph(rep='embedding', model=model, threshold=0.3)

# Find the most central verse
doc, scores = surah.central_verse(method='pagerank')
print(f'Most central: Ayah {doc._.ayah}')
print(doc._.text)
print(scores)

# All centrality methods
for method in ['pagerank', 'degree', 'betweenness', 'eigenvector', 'mst']:
    doc, _ = surah.central_verse(method=method)
    print(f'{method:12s} → Ayah {doc._.ayah}')

# Maximum Spanning Tree
T = surah.mst()
import networkx as nx
print(nx.info(T))

# Access the underlying NetworkX graph directly
G = surah.graph
print(f'Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}')
for u, v, data in G.edges(data=True):
    print(f'  Ayah {u+1} ↔ Ayah {v+1}: similarity = {data["weight"]:.3f}')

You can also use the lower-level graph module directly with any list of docs:

from quranic_nlp import language, graph

nlp = language.Pipeline('pos,root,lem')
docs = language.surah_docs(nlp, 'فاتحه')   # or surah_docs(nlp, 1)

G = graph.build_graph(docs, rep='tfidf')
T = graph.mst(G)
doc, scores = graph.central_verse(G, docs, method='pagerank')
print(doc._.surah, doc._.ayah, doc._.text)

Token Pattern Queries

quranic_nlp.query provides spaCy-style token-pattern matching across Quranic verses. Patterns filter on any combination of ROOT, LEMMA, POS, DEP, TEXT, and ARC, with proximity constraints and quantifiers.

Pattern syntax

Key	Description
`TEXT`	Exact surface form (with diacritics)
`LOWER`	Lowercase surface form
`LEMMA`	Canonical lemma
`POS`	Universal POS tag (`'NOUN'`, `'VERB'`, `'ADJ'`, …)
`DEP`	Dependency relation label
`ROOT`	Trilateral Arabic root (e.g. `'رحم'`, `'علم'`)
`ARC`	Dependency arc direction (`'LTR'` / `'RTL'`)
`OP`	Quantifier: `'?'` (0-1), `'*'` (0+), `'+'` (1+), `'!'` (must not match)
`SKIP`	Max tokens to skip before this element — enables proximity matching

Attribute values can be a string (exact match), a list (any-of), or a dict {"IN": [...]} / {"NOT_IN": [...]} / {"REGEX": "..."}.

`VerseMatcher` — full pattern control

from quranic_nlp import language, query

nlp = language.Pipeline('pos,root,lem,dep')
matcher = query.VerseMatcher(nlp)

# Verses containing a NOUN with root رحم
matcher.add('MERCY_NOUN', [[{'ROOT': 'رحم', 'POS': 'NOUN'}]])

# رحم root within 5 tokens of lemma الله  (SKIP for proximity)
matcher.add('MERCY_NEAR_ALLAH', [[
    {'ROOT': 'رحم'},
    {'LEMMA': 'الله', 'SKIP': 5},
]])

# VERB followed within 3 tokens by a NOUN
matcher.add('VERB_THEN_NOUN', [[
    {'POS': 'VERB'},
    {'POS': 'NOUN', 'SKIP': 3},
]])

# Two alternatives under one key
matcher.add('FORGIVENESS', [
    [{'ROOT': 'غفر'}],
    [{'ROOT': 'عفو'}],
])

# Search a single surah — yields (doc, [(key, start, end), ...])
for doc, matches in matcher.search(surah=2):
    for key, start, end in matches:
        print(key, doc._.ayah, doc[start:end])

# Search pre-computed docs (fastest — pipeline already ran)
docs = language.surah_docs(nlp, 'بقره')
for doc, matches in matcher.search(docs=docs):
    for key, start, end in matches:
        print(key, doc._.surah, doc._.ayah, doc[start:end])

Convenience functions

# All verses where root رحم appears as a NOUN
results = query.find_by_root(nlp, 'رحم', pos='NOUN', surah=1)

# All verses containing lemma الله
results = query.find_by_lemma(nlp, 'الله', surah=2)

# All verses with at least one VERB
results = query.find_by_pos(nlp, 'VERB', surah=1)

# رحم within 5 tokens of الله  (either direction)
results = query.find_near(nlp,
    {'ROOT': 'رحم'}, {'LEMMA': 'الله'}, max_dist=5, surah=1)
for doc, s1, e1, s2, e2 in results:
    print(doc._.ayah, doc[s1:e1], '…', doc[s2:e2])

# Verses containing BOTH رحم root AND علم root  (AND mode)
results = query.find_verses(nlp,
    [{'ROOT': 'رحم'}, {'ROOT': 'علم'}], mode='AND')

# Verses containing رحم OR غفر root  (OR mode)
results = query.find_verses(nlp,
    [{'ROOT': 'رحم'}, {'ROOT': 'غفر'}], mode='OR')

# KWIC concordance — keyword in context
rows = query.concordance(nlp, {'ROOT': 'رحم'}, context=3, surah=1)
for row in rows:
    left  = ' '.join(t.text for t in row['left'])
    right = ' '.join(t.text for t in row['right'])
    print(f"{row['surah']}:{row['ayah']}  {left} [{row['match'].text}] {right}")

Cross-Verse Corpus Search

quranic_nlp.corpus provides a high-speed cross-verse pattern matcher that treats the entire Quran as one flat sequence of ~128 K tokens. Patterns can freely span verse and surah boundaries. Lookup is O(log N) per step via pre-built inverted numpy indexes.

TAG notation uses the Quranic Treebank scheme: N noun · V verb · P preposition · PN proper noun · PRON pronoun · CONJ conjunction · DET determiner · ADJ adjective · NEG negation …

Build / load the index

from quranic_nlp.corpus import CorpusIndex

# First time (~1–2 s): build from morphologhy.csv and save to disk
idx = CorpusIndex.build(save=True)

# Subsequent calls: load from cache in ~0.04 s
idx = CorpusIndex.load()
print(idx)
# → CorpusIndex(N=128,219)

Single-condition search

# All occurrences of root رحم in the Quran
matches = idx.find_root('رحم', max_results=5)
for m in matches:
    print(m)
# → CorpusMatch(key='ROOT:رحم', refs=[1:1], text='رَّحْمَٰنِ')
# → CorpusMatch(key='ROOT:رحم', refs=[1:1], text='رَّحِيمِ')
# → CorpusMatch(key='ROOT:رحم', refs=[1:3], text='رَّحْمَٰنِ')
# → CorpusMatch(key='ROOT:رحم', refs=[1:3], text='رَّحِيمِ')
# → CorpusMatch(key='ROOT:رحم', refs=[2:37], text='رَّحِيمُ')

# Noun occurrences only
matches = idx.find_root('رحم', tag='N', max_results=3)

# By lemma
matches = idx.find_lemma('ٱللَّه', max_results=5)

Proximity search with SKIP (cross-verse)

# Root رحم anywhere within 5 tokens of root علم — crosses verse boundaries
matches = idx.find_root_near_root('رحم', 'علم', max_dist=5, max_results=5)
for m in matches:
    print(m)
    for t in m.tokens:
        print(f'  {t.soure}:{t.ayeh} tok={t.tok_i}  {t.text!r:20}  root={t.root!r}  tag={t.tag}')
# → CorpusMatch(key='ROOT:رحم+ROOT:علم', refs=[5:39, 5:40], text='رَّحِيمٌ تَعْلَمْ')
#     5:39 tok=18  'رَّحِيمٌ'          root='رحم'  tag=ADJ
#     5:40 tok=2   'تَعْلَمْ'          root='علم'  tag=V
# → CorpusMatch(key='ROOT:رحم+ROOT:علم', refs=[55:1, 55:2], text='رَّحْمَٰنُ عَلَّمَ')
#     55:1 tok=0   'رَّحْمَٰنُ'        root='رحم'  tag=N
#     55:2 tok=0   'عَلَّمَ'           root='علم'  tag=V

Surah 55 (Al-Rahman): الرَّحْمَٰنُ عَلَّمَ — a perfect cross-verse match found automatically!

Complex multi-element patterns

# Noun صبر followed within 3 tokens by a verb (cross-verse OK)
matches = idx.search([
    {'TAG': 'N', 'ROOT': 'صبر'},
    {'TAG': 'V', 'SKIP': 3},
], max_results=5)
for m in matches:
    print(m)
# → CorpusMatch(key='match', refs=[2:153, 2:154], text='صَّٰبِرِينَ تَقُولُ')
# → CorpusMatch(key='match', refs=[2:155, 2:156], text='صَّٰبِرِينَ أَصَٰبَتْ')
# → CorpusMatch(key='match', refs=[2:249, 2:250], text='صَّٰبِرِينَ بَرَزُ')
# → CorpusMatch(key='match', refs=[2:250],         text='صَبْرًا ثَبِّتْ')
# → CorpusMatch(key='match', refs=[3:142, 3:143],  text='صَّٰبِرِينَ كُن')

# Optional DET between root علم and a noun (OP='?')
matches = idx.search([
    {'ROOT': 'علم'},
    {'TAG': 'DET', 'OP': '?'},
    {'TAG': 'N'},
], max_results=5)
for m in matches:
    print(m)
# → CorpusMatch(key='match', refs=[2:33],        text='أَعْلَمُ غَيْبَ')
# → CorpusMatch(key='match', refs=[2:60],        text='عَلِمَ كُلُّ')
# → CorpusMatch(key='match', refs=[2:127, 2:128],text='عَلِيمُ رَبَّ')
# → CorpusMatch(key='match', refs=[2:220],       text='يَعْلَمُ ٱلْ مُفْسِدَ')

# Any-of roots (IN syntax)
matches = idx.search([{'ROOT': {'IN': ['رحم', 'علم', 'صبر']}}], max_results=5)

# Cross-verse: رحم ending one verse, علم starting the next (SKIP=1)
matches = idx.search([
    {'ROOT': 'رحم'},
    {'ROOT': 'علم', 'SKIP': 1},
])
for m in matches:
    if len(m.refs) > 1:
        print(m)   # crosses a verse boundary
# → CorpusMatch(key='match', refs=[55:1, 55:2], text='رَّحْمَٰنُ عَلَّمَ')

Inspecting matches

m = matches[0]
print(m.refs)    # → [(55, 1), (55, 2)]
print(m.text)    # → 'رَّحْمَٰنُ عَلَّمَ'
print(m.start, m.end)   # flat corpus positions

for t in m.tokens:
    print(t.soure, t.ayeh, t.tok_i, t.text, t.simple, t.lemma, t.root, t.tag)
# → 55 1 0 رَّحْمَٰنُ  الرحمان  رَّحْمَٰن  رحم  N
# → 55 2 0 عَلَّمَ    علم      عَلَّم     علم  V

Hadiths

Hadith fetching is disabled by default (it makes one HTTP request per verse, which is slow for surah-level processing). Enable it explicitly with hadiths=True:

# Create a pipeline with hadith fetching enabled
nlp_h = language.Pipeline(pips, translation_lang='fa#1', hadiths=True)
doc = nlp_h('1#1')

hadiths = doc._.hadiths
if hadiths:
    print(f'Found {len(hadiths)} hadith(s)')
    print(hadiths[0])
else:
    print('No hadiths found or API unavailable.')

When hadiths=False (the default), doc._.hadiths is None.

Visualization

Render the dependency parse tree using spaCy's displacy:

from spacy import displacy

options = {'compact': True, 'bg': '#09a3d5', 'color': 'white', 'font': 'Arial'}
displacy.render(doc, style='dep', options=options, jupyter=True)

Contributors

Seyyed Mohammad Aref Jahanmir
Alireza Sahebi
Doratossadat Dastgheyb
Erfan Mohammadi
Mahdi Ahmadi
Ehsaneddin Asgari

📧 Contact: asgari [dot] berkeley [dot] edu

Contributing

We warmly welcome contributions from the community! Whether you are a researcher, developer, linguist, or simply passionate about the Quran and NLP, there are many ways to get involved:

Area	How to Help
New features	New pipeline components, morphological analyses, or language support
Data quality	Corrections to POS tags, dependency parses, lemmas, or roots
Translations	Add or improve Quranic translations for underrepresented languages
Testing	Help increase test coverage
Bug reports	Open an issue if something doesn't work as expected
Documentation	Clearer examples, tutorials, or API docs

To contribute, fork the repository, make your changes, and open a pull request. For larger changes, please open an issue first to discuss your idea.

We believe open collaboration leads to better tools for everyone. Every contribution, big or small, is valued and appreciated.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.3.9

Mar 10, 2026

1.3.8

Mar 10, 2026

1.3.7

Mar 10, 2026

1.3.6

Mar 10, 2026

1.3.5

Mar 10, 2026

1.3.4

Mar 10, 2026

1.3.3

Mar 10, 2026

1.3.2

Mar 10, 2026

1.3.1

Mar 10, 2026

1.3.0

Mar 9, 2026

1.2.9

Mar 9, 2026

1.2.8

Mar 9, 2026

1.2.7

Mar 9, 2026

1.2.6

Mar 9, 2026

1.2.5

Mar 9, 2026

1.2.4

Mar 9, 2026

1.2.3

Mar 9, 2026

1.2.2

Mar 9, 2026

1.2.1

Mar 9, 2026

1.1.8

Jun 10, 2023

1.1.7

Jun 8, 2023

1.1.6

Mar 18, 2023

1.1.5

Mar 17, 2023

1.1.4

Mar 16, 2023

1.1.3

Mar 16, 2023

1.1.2

Feb 23, 2023

1.1

Feb 19, 2023

1.0

Feb 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quranic_nlp-1.3.9.tar.gz (82.7 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quranic_nlp-1.3.9-py3-none-any.whl (41.8 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file quranic_nlp-1.3.9.tar.gz.

File metadata

Download URL: quranic_nlp-1.3.9.tar.gz
Upload date: Mar 10, 2026
Size: 82.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quranic_nlp-1.3.9.tar.gz
Algorithm	Hash digest
SHA256	`84c206a798b50b5d2624971c4e91ff78222405711c283134e4c7a7c6e3c68be4`
MD5	`116f8bcd0d0ca3116873b39f83cca0f5`
BLAKE2b-256	`a5cc6ea5060feecdea5289584f352c94c38365ee668702b9ac7d385a7467b9d4`

See more details on using hashes here.

File details

Details for the file quranic_nlp-1.3.9-py3-none-any.whl.

File metadata

Download URL: quranic_nlp-1.3.9-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 41.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quranic_nlp-1.3.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5efdbc187790d13adedf5da4e9098233a44912c2751552a21c0229b5a0fba663`
MD5	`c2a6c23c1497bd0510cf3dfe17ba785c`
BLAKE2b-256	`d80534f3fdc351db3c988f81e493c895f36322286f6456194aa7ea4bb183316d`

See more details on using hashes here.

quranic-nlp 1.3.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

QuranicTools: A Python NLP Library for Quranic NLP

Part of Speech Tagging | Dependency Parsing | Lemmatizer | Multilingual Search | Quranic Extractions | Revelation Order | Surah Graph Analysis | Translations | Hadiths

Quranic NLP

Installation

Step 1 — Install the package

Step 2 — Download the data

Development Setup

Pipeline

Input Formats

Verse Information

Translations

Similar Verses

Multiple Matches

Word-level Analysis

JSON Output

Surah-Level Graph Analysis

Token Pattern Queries

Pattern syntax

VerseMatcher — full pattern control

Convenience functions

Cross-Verse Corpus Search

Build / load the index

Single-condition search

Proximity search with SKIP (cross-verse)

Complex multi-element patterns

Inspecting matches

Hadiths

Visualization

Contributors

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Part of Speech Tagging | Dependency Parsing | Lemmatizer | Multilingual Search
| Quranic Extractions | Revelation Order |
Surah Graph Analysis | Translations | Hadiths

`VerseMatcher` — full pattern control