Skip to main content

Persian NLP Toolkit

Project description

Hazm - Persian NLP Toolkit

Tests PyPI - Downloads PyPI - Python Version GitHub

Accuracy DependencyParser: 85.6% POSTagger: 98.8% Chunker: 93.4% Lemmatizer: 89.9%

Introduction

Hazm is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.

Features

  • Normalization: Converts text to a standard form, such as removing diacritics, correcting spacing, etc.
  • Tokenization: Splits text into sentences and words.
  • Lemmatization: Reduces words to their base forms.
  • POS tagging: Assigns a part of speech to each word.
  • Dependency parsing: Identifies the syntactic relations between words.
  • Embedding: Creates vector representations of words and sentences.
  • Persian corpora reading: Easily read popular Persian corpora with ready-made scripts and minimal code.

Installation

To install the latest version of Hazm, run the following command in your terminal:

pip install hazm

Alternatively, you can install the latest update from GitHub (this version may be unstable and buggy):

pip install git+https://github.com/roshan-research/hazm.git

Pretrained-Models

Finally if you want to use our pretrained models, you can download it from the links below:

Module name Size
Download WordEmbedding ~ 5 GB
Download SentEmbedding ~ 1 GB
Download POSTagger ~ 18 MB
Download UniversalDependencyParser ~ 15 MB
Download DependencyParser ~ 13 MB
Download Chunker ~ 4 MB

Usage

>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> stemmer = Stemmer()
>>> stemmer.stem('کتاب‌ها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
>>> word_embedding.doesnt_match(['ساعت' ,'پلنگ' ,'شیر'])
'ساعت'

>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

Documentation

Visit https://roshan-ai.ir/hazm/docs to view the full documentation.

Hazm in other languages

Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..

  • JHazm: A Java port of Hazm
  • NHazm: A C# port of Hazm

Contribution

We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!

Thanks

Code contributores

Alt

Others

  • Thanks to Virastyar project for providing the persian word list.

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hazm-0.9.2.tar.gz (338.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hazm-0.9.2-py3-none-any.whl (352.8 kB view details)

Uploaded Python 3

File details

Details for the file hazm-0.9.2.tar.gz.

File metadata

  • Download URL: hazm-0.9.2.tar.gz
  • Upload date:
  • Size: 338.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure

File hashes

Hashes for hazm-0.9.2.tar.gz
Algorithm Hash digest
SHA256 a5c9c1bab6c042eecab58a473ee237b62fd674e985023610e2a12acae859f56e
MD5 2fa9a598cfbbc7103e3e0214e9b0e45d
BLAKE2b-256 e41d0d2ee71aacd7bcf074b755a704ad05f1ad9c414e96e605596d6c1514e7cd

See more details on using hashes here.

File details

Details for the file hazm-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: hazm-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 352.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure

File hashes

Hashes for hazm-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89b51b80a8940e00d410813763f93600c8fbbb16c97d5bf70e3de67754c2b7c5
MD5 7c79229e08e0317a30cdd8e6d8e64aa1
BLAKE2b-256 4fa8909a6596229fd7fe1155633027a237310ebb61d82b6327b3fa8a5941947a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page