Persian NLP Toolkit
Project description
Hazm - Persian NLP Toolkit
Hazm is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.
Features
- Normalization: Converts text to a standard form (diacritics removal, ZWNJ correction, etc).
- Tokenization: Splits text into sentences and words.
- Lemmatization: Reduces words to their base forms.
- POS tagging: Assigns a part of speech to each word.
- Dependency parsing: Identifies the syntactic relations between words.
- Embedding: Creates vector representations of words and sentences.
- Hugging Face Integration: Automatically download and cache pretrained models from the Hub.
- Persian corpora reading: Easily read popular Persian corpora with ready-made scripts.
Installation
To install the latest version of Hazm (requires Python 3.12+), run:
pip install hazm
To use the pretrained models from Hugging Face, ensure you have the huggingface-hub package:
pip install huggingface-hub
Pretrained-Models
Hazm supports automatic downloading of pretrained models. You can find all available models (POS Tagger, Chunker, Embeddings, etc.) on our official Hugging Face page:
👉 Roshan Research on Hugging Face
When using Hazm, simply provide the repo_id and model_filename as shown in the examples below, and the library will handle the rest.
Usage
from hazm import *
# ===============================
# Stemming
# ===============================
stemmer = Stemmer()
stem = stemmer.stem('کتابها')
print(stem) # کتاب
# ===============================
# Normalizing
# ===============================
normalizer = Normalizer()
normalized_text = normalizer.normalize('من کتاب های زیــــادی دارم .')
print(normalized_text) # من کتابهای زیادی دارم.
# ===============================
# Lemmatizing
# ===============================
lemmatizer = Lemmatizer()
lem = lemmatizer.lemmatize('مینویسیم')
print(lem) # نوشت#نویس
# ===============================
# Sentence tokenizing
# ===============================
sentence_tokenizer = SentenceTokenizer()
sent_tokens = sentence_tokenizer.tokenize('ما کتاب میخوانیم. یادگیری خوب است.')
print(sent_tokens) # ['ما کتاب می\u200cخوانیم.', 'یادگیری خوب است.']
# ===============================
# Word tokenizing
# ===============================
word_tokenizer = WordTokenizer()
word_tokens = word_tokenizer.tokenize('ما کتاب میخوانیم')
print(word_tokens) # ['ما', 'کتاب', 'می\u200cخوانیم']
# ===============================
# Part of speech tagging
# ===============================
tagger = POSTagger(repo_id="roshan-research/hazm-postagger", model_filename="pos_tagger.model")
tagged_words = tagger.tag(word_tokens)
print(tagged_words) # [('ما', 'PRON'), ('کتاب', 'NOUN'), ('می\u200cخوانیم', 'VERB')]
# ===============================
# Chunking
# ===============================
chunker = Chunker(repo_id="roshan-research/hazm-chunker", model_filename="chunker.model")
chunked_tree = tree2brackets(chunker.parse(tagged_words))
print(chunked_tree) # [ما NP] [کتاب NP] [میخوانیم VP]
# ===============================
# Word embedding
# ===============================
word_embedding = WordEmbedding.load(repo_id='roshan-research/hazm-word-embedding', model_filename='fasttext_skipgram_300.bin', model_type='fasttext')
odd_word = word_embedding.doesnt_match(['کتاب', 'دفتر', 'قلم', 'پنجره'])
print(odd_word) # پنجره
# ===============================
# Sentence embedding
# ===============================
sent_embedding = SentEmbedding.load(repo_id='roshan-research/hazm-sent-embedding', model_filename='sent2vec-naab.model')
sentence_similarity = sent_embedding.similarity('او شیر میخورد','شیر غذا میخورد')
print(sentence_similarity) # 0.4643607437610626
# ===============================
# Dependency parsing
# ===============================
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, repo_id="roshan-research/hazm-dependency-parser", model_filename="langModel.mco")
dependency_graph = parser.parse(word_tokens)
print(dependency_graph)
"""
{0: {'address': 0,
'ctag': 'TOP',
'deps': defaultdict(<class 'list'>, {'root': [3]}),
'feats': None,
'head': None,
'lemma': None,
'rel': None,
'tag': 'TOP',
'word': None},
1: {'address': 1,
'ctag': 'PRON',
'deps': defaultdict(<class 'list'>, {}),
'feats': '_',
'head': 3,
'lemma': 'ما',
'rel': 'SBJ',
'tag': 'PRON',
'word': 'ما'},
2: {'address': 2,
'ctag': 'NOUN',
'deps': defaultdict(<class 'list'>, {}),
'feats': '_',
'head': 3,
'lemma': 'کتاب',
'rel': 'OBJ',
'tag': 'NOUN',
'word': 'کتاب'},
3: {'address': 3,
'ctag': 'VERB',
'deps': defaultdict(<class 'list'>, {'SBJ': [1], 'OBJ': [2]}),
'feats': '_',
'head': 0,
'lemma': 'خواند#خوان',
'rel': 'root',
'tag': 'VERB',
'word': 'می\u200cخوانیم'}})
"""
Documentation
Visit https://roshan-ai.ir/hazm to view the full documentation.
Evaluation
| Module name | |
|---|---|
| DependencyParser | 85.6% |
| POSTagger | 98.8% |
| Chunker | 93.4% |
| Lemmatizer | 89.9% |
| Metric | Value | |
|---|---|---|
| SpacyPOSTagger | Precision | 0.99250 |
| Recall | 0.99249 | |
| F1-Score | 0.99249 | |
| EZ Detection in SpacyPOSTagger | Precision | 0.99301 |
| Recall | 0.99297 | |
| F1-Score | 0.99298 | |
| SpacyChunker | Accuracy | 96.53% |
| F-Measure | 95.00% | |
| Recall | 95.17% | |
| Precision | 94.83% | |
| SpacyDependencyParser | TOK Accuracy | 99.06 |
| UAS | 92.30 | |
| LAS | 89.15 | |
| SENT Precision | 98.84 | |
| SENT Recall | 99.38 | |
| SENT F-Measure | 99.11 |
Code contributores
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hazm-0.11.0.tar.gz.
File metadata
- Download URL: hazm-0.11.0.tar.gz
- Upload date:
- Size: 866.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
098f1318a70b6388dc2f39cae68327722c88b057848e72d07f6b8b2cae27b5fa
|
|
| MD5 |
efc3693d019c6dffe6adafc0aede1e1c
|
|
| BLAKE2b-256 |
6cd12d31d80dfc0223cb4e437071ffe151c5acd2049caee6c6225c433fd650ea
|
File details
Details for the file hazm-0.11.0-py3-none-any.whl.
File metadata
- Download URL: hazm-0.11.0-py3-none-any.whl
- Upload date:
- Size: 886.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb469cabfe3606fb9088723f8c5b352472788d520703e805020678a7d204241
|
|
| MD5 |
48ca14d02d6000499a4abf2298379c3b
|
|
| BLAKE2b-256 |
46f0a63ce22975a9dd128590f56af58374f3be7845de2a95e73d5b2fac277318
|