Skip to main content

Persian NLP Toolkit

Project description

Hazm

Tests PyPI - Downloads GitHub

Python library for digesting Persian text.

  • Text cleaning
  • Sentence and word tokenizer
  • Word lemmatizer
  • POS tagger
  • Shallow parser
  • Dependency parser
  • Interfaces for Persian corpora
  • NLTK compatible
  • Python 3.8, 3.9, 3.10 and 3.11 support

Documentation

Visit https://roshan-ai.ir/hazm/docs to view the full documentation.

Modules accuracy

Module name accuracy
Lemmatizer 89.9%
Chunker 93.4% download pre-trained model
POSTagger 97.2% download pre-trained model
POSTagger(Universal) 98.8% download pre-trained model
DependencyParser 97.1% download pre-trained model

Installation

The latest stable version of Hazm can be installed through pip:

pip install hazm

But for testing or using Hazm with the latest updates you may use:

pip install https://github.com/roshan-research/hazm/archive/master.zip --upgrade

Usage

>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> stemmer = Stemmer()
>>> stemmer.stem('کتاب‌ها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='resources/pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> chunker = Chunker(model='resources/chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

Extensions

Note: These are not official versions of hazm, not uptodate on functionality and are not supported by Roshan.

  • JHazm: A Java port of Hazm
  • NHazm: A C# port of Hazm

Contribution

We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!

Thanks

Code contributores

Alt

Others

  • Thanks to Virastyar project for providing the persian word list.

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hazm-0.9.1.tar.gz (333.7 kB view details)

Uploaded Source

Built Distribution

hazm-0.9.1-py3-none-any.whl (349.7 kB view details)

Uploaded Python 3

File details

Details for the file hazm-0.9.1.tar.gz.

File metadata

  • Download URL: hazm-0.9.1.tar.gz
  • Upload date:
  • Size: 333.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.0 Windows/10

File hashes

Hashes for hazm-0.9.1.tar.gz
Algorithm Hash digest
SHA256 be52ad5c46c09186ebeaf6169f1638afceb4f9330211fd0dacd25736a6ac3d7c
MD5 1077a0b3eabe2f1363b723ad0018e640
BLAKE2b-256 1f6ed8cedea269e5e72cb863ae5d7a53d1e09ce5c963346edd20893f97a1e9e6

See more details on using hashes here.

File details

Details for the file hazm-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: hazm-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 349.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.0 Windows/10

File hashes

Hashes for hazm-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb75ab0c45378614b261e1db33cea5c836a0a7fd3d7bc93e419d0651313bbfca
MD5 8cbc9377d0f130e5584e8f1fa561cf42
BLAKE2b-256 c82d447005ee30162d36da7317ec6d115f8a531bba66a6c4ca0daf1aaa77b21c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page