Skip to main content

quarnic nlp

Project description

QuranicTools: A Python NLP Library for Quranic NLP

Open In Colab
Part of Speech Tagging | Dependency Parsing | Lemmatizer | Multilingual Search
| Quranic Extractions | Revelation Order |
Embeddings (coming soon) | Translations

Quranic NLP

Quranic NLP is a computational toolbox to conduct various syntactic and semantic analyses of Quranic verses. The aim is to put together all available resources contributing to a better understanding/analysis of the Quran for everyone.

Contents:

Installation

Step 1 — Install the package

pip install quranic-nlp

Step 2 — Download the data

The library requires data files (~97MB) that are downloaded separately from GitHub Releases:

quranic_data

Or from Python:

from quranic_nlp.data_requirements import download_data
download_data()

Data is downloaded once and stored inside the package directory automatically.

Development Setup

To set up a local development environment:

git clone https://github.com/language-ml/hadith-quranic_nlp.git
cd hadith-quranic_nlp
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install -e .
quranic_data

Pipeline

Available pipeline components:

Key Description
dep Dependency parsing
pos Part-of-speech tagging
root Root extraction
lem Lemmatization
from quranic_nlp import language, utils, constant

pips = 'dep,pos,root,lem'
nlp = language.Pipeline(pips, translation_lang='fa#1')

To see all available translation languages and translators:

utils.print_all_translations()

Input Formats

Three ways to reference a verse:

# 1. surah_number#ayah_number (no internet required)
doc = nlp('1#1')

# 2. surah_name#ayah_number (requires internet)
doc = nlp('حمد#1')

# 3. Free Arabic text — returns a list of all matching docs (requires internet)
docs = nlp('رب العالمین')

Verse Information

doc = nlp('1#1')

print(doc)                   # بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِیمِ
print(doc._.text)            # بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ  (full diacritics)
print(doc._.surah)           # فاتحه
print(doc._.ayah)            # 1
print(doc._.revelation_order)  # 5

Translations

Pass '<lang>#<index>' for a single translator (returns a string):

nlp_en = language.Pipeline(pips, 'en#16')   # Yusuf Ali
doc = nlp_en('1#1')
print(doc._.translations)
# In the name of Allah, the Beneficent, the Merciful.

Pass '<lang>' (no index) for all translators (returns a dict keyed by translator name):

nlp_fa = language.Pipeline(pips, 'fa')
doc = nlp_fa('1#2')
print(doc._.translations)
# {
#   'ansarian': 'همه ستایش ها، ویژه خدا، مالک و مربّی جهانیان است.',
#   'ayati':    'ستايش خدا را كه پروردگار جهانيان است.',
#   'bahrampour': 'ستايش خداى را كه پروردگار جهانيان است',
#   ...   # 12 Persian translators total
# }

Similar Verses

doc._.sim_ayahs returns a list of (ref, score) tuples sorted by similarity score:

doc = nlp('1#2')
for ref, score in doc._.sim_ayahs[:5]:
    print(f'{ref:10s}  score={score:.4f}')
37#182      score=1.0000
6#45        score=0.5199
40#65       score=0.4620
10#10       score=0.3862
39#75       score=0.3793

Multiple Matches

When free Arabic text matches multiple verses, nlp(text) returns a list of docs:

docs = nlp('رب العالمین')
print(f'Found {len(docs)} matching verses')
for doc in docs[:3]:
    print(doc._.surah, doc._.ayah, '—', doc._.text)
فاتحه 2 — الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
مائده 28 — لَئِن بَسَطتَ إِلَيَّ يَدَكَ...
انعام 45 — فَقُطِعَ دَابِرُ الْقَوْمِ...

You can also call search_all explicitly with a max_results cap:

docs = language.search_all(nlp, 'رب العالمین', max_results=5)

Word-level Analysis

doc = nlp('1#1')
word = doc[2]  # third word: اللَّهِ

print(word)                            # اللَّهِ
print(word.pos_)                       # NOUN
print(constant.POS_UNI_FA[word.pos_]) # اسم
print(word.lemma_)                     # ٱللَّه
print(word._.root)                     # اله
print(word.dep_)                       # نعت
print(word._.dep_arc)                  # LTR  (Left-to-Right arc)
print(word.head)                       # رَّحِیمِ

Print a table of all words:

print(f"{'Word':<20} {'POS':<8} {'Lemma':<15} {'Root':<10} {'Dep'}")
print('-' * 65)
for token in doc:
    print(f'{str(token):<20} {token.pos_:<8} {token.lemma_:<15} {str(token._.root):<10} {token.dep_}')

JSON Output

import json

result = language.to_json(pips, doc)
print(json.dumps(result, ensure_ascii=False, indent=2))
[
  {"id": 1, "text": "بِ",      "root": "",    "lemma": "",       "pos": "INTJ", "rel": "مجرور",      "arc": "LTR", "head": "سْمِ"},
  {"id": 2, "text": "سْمِ",   "root": "سمو", "lemma": "ٱسْم",  "pos": "NOUN", "rel": "مضاف الیه", "arc": "LTR", "head": "اللَّهِ"},
  {"id": 3, "text": "اللَّهِ","root": "اله", "lemma": "ٱللَّه","pos": "NOUN", "rel": "نعت",        "arc": "LTR", "head": "رَّحِیمِ"},
  ...
]

Surah-Level Graph Analysis

nlp('فاتحه') (or nlp(1)) returns a SurahDoc — an object containing all verse docs for the surah and tools for graph-based analysis.

from quranic_nlp import language, graph

nlp = language.Pipeline('pos,root,lem', 'fa#1')

# Get all verses of a surah as a SurahDoc
surah = nlp('فاتحه')          # by Arabic name
# surah = nlp(1)              # by integer index
# surah = nlp('1')            # by string index

print(f'{surah.surah}: {len(surah)} verses')

# Iterate over verse docs
for doc in surah:
    print(doc._.ayah, doc._.text)

# Build a verse-similarity graph (TF-IDF over surface + lemma + root)
G = surah.build_graph(rep='tfidf')

# Or with a sentence-embedding model (any model with .encode())
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('CAMeL-Lab/bert-base-arabic-camelbert-ca')
# G = surah.build_graph(rep='embedding', model=model, threshold=0.3)

# Find the most central verse
doc, scores = surah.central_verse(method='pagerank')
print(f'Most central: Ayah {doc._.ayah}')
print(doc._.text)
print(scores)

# All centrality methods
for method in ['pagerank', 'degree', 'betweenness', 'eigenvector', 'mst']:
    doc, _ = surah.central_verse(method=method)
    print(f'{method:12s} → Ayah {doc._.ayah}')

# Maximum Spanning Tree
T = surah.mst()
import networkx as nx
print(nx.info(T))

# Access the underlying NetworkX graph directly
G = surah.graph
print(f'Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}')
for u, v, data in G.edges(data=True):
    print(f'  Ayah {u+1} ↔ Ayah {v+1}: similarity = {data["weight"]:.3f}')

You can also use the lower-level graph module directly with any list of docs:

from quranic_nlp import language, graph

nlp = language.Pipeline('pos,root,lem')
docs = language.surah_docs(nlp, 'فاتحه')   # or surah_docs(nlp, 1)

G = graph.build_graph(docs, rep='tfidf')
T = graph.mst(G)
doc, scores = graph.central_verse(G, docs, method='pagerank')
print(doc._.surah, doc._.ayah, doc._.text)

Hadiths

hadiths = doc._.hadiths
if hadiths:
    print(f'Found {len(hadiths)} hadith(s)')
    print(hadiths[0])
else:
    print('No hadiths found or API unavailable.')

Visualization

Render the dependency parse tree using spaCy's displacy:

from spacy import displacy

options = {'compact': True, 'bg': '#09a3d5', 'color': 'white', 'font': 'Arial'}
displacy.render(doc, style='dep', options=options, jupyter=True)

Contributors

  • Seyyed Mohammad Aref Jahanmir
  • Alireza Sahebi
  • Doratossadat Dastgheyb
  • Erfan Mohammadi
  • Mahdi Ahmadi
  • Ehsaneddin Asgari

📧 Contact: asgari [dot] berkeley [dot] edu

Contributing

We warmly welcome contributions from the community! Whether you are a researcher, developer, linguist, or simply passionate about the Quran and NLP, there are many ways to get involved:

Area How to Help
New features New pipeline components, morphological analyses, or language support
Data quality Corrections to POS tags, dependency parses, lemmas, or roots
Translations Add or improve Quranic translations for underrepresented languages
Testing Help increase test coverage
Bug reports Open an issue if something doesn't work as expected
Documentation Clearer examples, tutorials, or API docs

To contribute, fork the repository, make your changes, and open a pull request. For larger changes, please open an issue first to discuss your idea.

We believe open collaboration leads to better tools for everyone. Every contribution, big or small, is valued and appreciated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quranic_nlp-1.3.6.tar.gz (60.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quranic_nlp-1.3.6-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file quranic_nlp-1.3.6.tar.gz.

File metadata

  • Download URL: quranic_nlp-1.3.6.tar.gz
  • Upload date:
  • Size: 60.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quranic_nlp-1.3.6.tar.gz
Algorithm Hash digest
SHA256 2099f6b2f2f1c6b545db3863a350453202b9068804e82ab10889ebb4cef88a28
MD5 8e7eda1946bfc64998fd4f95b40005bb
BLAKE2b-256 9047eabf820d7451c9ca2b0ae7fef2802b78c5d9e69622f549ace249e97f26d7

See more details on using hashes here.

File details

Details for the file quranic_nlp-1.3.6-py3-none-any.whl.

File metadata

  • Download URL: quranic_nlp-1.3.6-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quranic_nlp-1.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 61349ce224bb4228bc70398dca77d8a7b6f2f8b60046fd7dfa7803d92476bec5
MD5 0d3d8a60deaecf6029d4a77af4a21b8d
BLAKE2b-256 5227c9dbea708368cc80d5d08562dbbced1d013b2af2b5ef89e0e0aed9e25178

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page