DadmaTools is a Persian NLP toolkit

These details have not been verified by PyPI

Project links

Homepage

Project description

DadmaTools: A Python NLP Library for Persian

Named Entity Recognition | Part of Speech Tagging | Dependency Parsing

Constituency Parsing | Chunking

Tokenizer | Lemmatizer

DadmaTools

DadmaTools is a repository for Natural Language Processing resources for the Persian Language. The aim is to make it easier and more applicable to practitioners in the industry to use Persian NLP, and hence this project is licensed to allow commercial use. The project features code examples on how to use the models in popular NLP frameworks such as spaCy and Transformers, as well as Deep Learning frameworks such as PyTorch. Furthermore, DadmaTools support common Persian embedding and Persian datasets. for more details about how to use this tool read the instruction below.

NLP Models

Natural Language Processing is an active area of research, and it consists of many different tasks. The DadmaTools repository provides an overview of Persian models for some of the most basic NLP tasks (and is continuously evolving).

Here is the list of NLP tasks we currently cover in the repository. These NLP tasks are defined as pipelines. Therefore, a pipeline list must be created and passed through the model. This will allow the user to choose the only task needed without loading others. Each task has its abbreviation as follows:

Named Entity Recognition: ner
Part of speech tagging: pos
Dependency parsing: dep
Constituency parsing: cons
Chunking: chunk
Lemmatizing: lem
Tokenizing: tok
Normalizing

Note that the normalizer can be used outside of the pipeline as there are several configs (the default config is in the pipeline with the name of def-norm). Note that if no pipeline is passed to the model, the tokenizer will be loaded as default.

Use Case

Normalizer

cleaning text and unify characters.

Note: None means no action!

from dadmatools.models.normalizer import Normalizer

normalizer = Normalizer(
    full_cleaning=False,
    unify_chars=True,
    refine_punc_spacing=True,
    remove_extra_space=True,
    remove_puncs=False,
    remove_html=False,
    remove_stop_word=False,
    replace_email_with="<EMAIL>",
    replace_number_with=None,
    replace_url_with="",
    replace_mobile_number_with=None,
    replace_emoji_with=None,
    replace_home_number_with=None
)

text = """
<p>
دادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده. 
امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه
لطفا با ایمیل dadmatools@dadmatech.ir با ما در ارتباط باشید
آدرس گیت‌هاب هم که خب معرف حضور مبارک هست:
 https://github.com/Dadmatech/DadmaTools
</p>
"""
normalized_text = normalizer.normalize(text)
#<p> دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه لطفا با ایمیل <EMAIL> با ما در ارتباط باشید آدرس گیت‌هاب هم که خب معرف حضور مبارک هست: </p>

#full cleaning
normalizer = Normalizer(full_cleaning=True)
normalized_text = normalizer.normalize(text)
#دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرین‌تر راحت‌تر کنه ایمیل ارتباط آدرس گیت‌هاب معرف حضور مبارک

Pipeline - Tokenizer, Lemmatizer, POS Tagger, Dependancy Parser, Constituency Parser

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok,lem,pos,dep,chunk,cons' 
nlp = language.Pipeline(pips)

# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))

# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که می‌گفت، گاهی حرص می‌خورد!')

doc object has different extensions. First, there are sentences in doc which is the list of the list of Token. Each Token also has its own extensions. Note that we defined our own extension as well in DadmaTools. If any pipeline related to the specific extensions is not called, that extension will have no value.

To better see the results which you can use this code:

dictionary = language.to_json(pips, doc)
print(dictionary)

[[{'id': 1, 'text': 'از', 'lemma': 'از', 'pos': 'ADP', 'rel': 'case', 'root': 2}, {'id': 2, 'text': 'قصهٔ', 'lemma': 'قصه', 'pos': 'NOUN', 'rel': 'obl', 'root': 10}, {'id': 3, 'text': 'کودکی', 'lemma': 'کودکی', 'pos': 'NOUN', 'rel': 'nmod', 'root': 2}, {'id': 4, 'text': 'شان', 'lemma': 'آنها', 'pos': 'PRON', 'rel': 'nmod', 'root': 3}, {'id': 5, 'text': 'که', 'lemma': 'که', 'pos': 'SCONJ', 'rel': 'mark', 'root': 6}, {'id': 6, 'text': 'می\u200cگفت', 'lemma': 'گفت#گو', 'pos': 'VERB', 'rel': 'acl', 'root': 2}, {'id': 7, 'text': '،', 'lemma': '،', 'pos': 'PUNCT', 'rel': 'punct', 'root': 6}, {'id': 8, 'text': 'گاهی', 'lemma': 'گاه', 'pos': 'NOUN', 'rel': 'obl', 'root': 10}, {'id': 9, 'text': 'حرص', 'lemma': 'حرص', 'pos': 'NOUN', 'rel': 'compound:lvc', 'root': 10}, {'id': 10, 'text': 'می\u200cخورد', 'lemma': 'خورد#خور', 'pos': 'VERB', 'rel': 'root', 'root': 0}, {'id': 11, 'text': '!', 'lemma': '!', 'pos': 'PUNCT', 'rel': 'punct', 'root': 10}]]

sentences = doc._.sentences
for sentence in sentences:
    text = sentence.text
    for token in sentences:
        token_text = token.text
        lemma = token.lemma_ ## this has value only if lem is called
        pos_tag = token.pos_ ## this has value only if pos is called
        dep = token.dep_ ## this has value only if dep is called
        dep_arc = token._.dep_arc ## this has value only if dep is called
sent_constituency = doc._.constituency ## this has value only if cons is called
sent_chunks = doc._.chunks ## this has value only if cons is called
ners = doc._.ners ## this has value only if ner is called

Note that _.constituency and _.chunks are the object of SuPar class.

Loading Persian NLP Datasets

We provide an easy-to-use way to load some popular Persian NLP datasets

Here is the list of supported datasets.

Dataset	Task
PersianNER	Named Entity Recognition
ARMAN	Named Entity Recognition
Peyma	Named Entity Recognition
FarsTail	Textual Entailment
FaSpell	Spell Checking
PersianNews	Text Classification
PerUDT	Universal Dependency
PnSummary	Text Summarization
SnappfoodSentiment	Sentiment Classification
TEP	Text Translation(eng-fa)
WikipediaCorpus	Corpus
PersianTweets	Corpus

all datasets are iterator and can be used like below:

from dadmatools.datasets import FarsTail
from dadmatools.datasets import SnappfoodSentiment
from dadmatools.datasets import Peyma
from dadmatools.datasets import PerUDT
from dadmatools.datasets import PersianTweets
from dadmatools.datasets import PnSummary


farstail = FarsTail()
#len of dataset
print(len(farstail.train))

#like a generator
print(next(farstail.train))

#dataset details
pn_summary = PnSummary()
print('PnSummary dataset information: ', pn_summary.info)

#loop over dataset
snpfood_sa = SnappfoodSentiment()
for i, item in enumerate(snpfood_sa.test):
    print(item['comment'], item['label'])

#get first tokens' lemma of all dev items
perudt = PerUDT()
for token_list in perudt.dev:
    print(token_list[0]['lemma'])

#get NER tag of first Peyma's data
peyma = Peyma()
print(next(peyma.data)[0]['tag'])

#corpus 
tweets = PersianTweets()
print('tweets count : ', len(tweets.data))
print('sample tweet: ', next(tweets.data))

get dataset info:

from dadmatools.datasets import get_all_datasets_info

get_all_datasets_info().keys()
#dict_keys(['Persian-NEWS', 'fa-wiki', 'faspell', 'PnSummary', 'TEP', 'PerUDT', 'FarsTail', 'Peyma', 'snappfoodSentiment', 'Persian-NER', 'Arman', 'PerSent'])

#specify task
get_all_datasets_info(tasks=['NER', 'Sentiment-Analysis'])

the output will be:

{"ARMAN": {"description": "ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.\n\nOrganization\nLocation\nFacility\nEvent\nProduct\nPerson",
  "filenames": ["train_fold1.txt",
   "train_fold2.txt",
   "train_fold3.txt",
   "test_fold1.txt",
   "test_fold2.txt",
   "test_fold3.txt"],
  "name": "ARMAN",
  "size": {"test": 7680, "train": 15361},
  "splits": ["train", "test"],
  "task": "NER",
  "version": "1.0.0"},
 "PersianNer": {"description": "source: https://github.com/Text-Mining/Persian-NER",
  "filenames": ["Persian-NER-part1.txt",
   "Persian-NER-part2.txt",
   "Persian-NER-part3.txt",
   "Persian-NER-part4.txt",
   "Persian-NER-part5.txt"],
  "name": "PersianNer",
  "size": 976599,
  "splits": [],
  "task": "NER",
  "version": "1.0.0"},
 "Peyma": {"description": "source: http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/",
  "filenames": ["peyma/600K", "peyma/300K"],
  "name": "Peyma",
  "size": 10016,
  "splits": [],
  "task": "NER",
  "version": "1.0.0"},
 "snappfoodSentiment": {"description": "source: https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood",
  "filenames": ["snappfood/train.csv",
   "snappfood/test.csv",
   "snappfood/dev.csv"],
  "name": "snappfoodSentiment",
  "size": {"dev": 6274, "test": 6972, "train": 56516},
  "splits": ["train", "test", "dev"],
  "task": "Sentiment-Analysis",
  "version": "1.0.0"}}

Loading Persian Word Embeddings

download, load and use some pre-trained Persian word embeddings.

dadmatools supports all glove, fasttext, and word2vec formats.

from dadmatools.embeddings import get_embedding, get_all_embeddings_info, get_embedding_info
from pprint import pprint

pprint(get_all_embeddings_info())

#get embedding information of specific embedding
embedding_info = get_embedding_info('glove-wiki')

#### load embedding ####
word_embedding = get_embedding('glove-wiki')

#get vector of the word
print(word_embedding['سلام'])

#vocab
vocab = word_embedding.get_vocab()

### some useful functions ###
print(word_embedding.top_nearest("زمستان", 10))
print(word_embedding.similarity('کتب', 'کتاب'))
print(word_embedding.embedding_text('امروز هوای خوبی بود'))

The following word embeddings are currently supported:

Name	Embedding Algorithm	Corpus
`glove-wiki`	glove	Wikipedia
`fasttext-commoncrawl-bin`	fasttext	CommonCrawl
`fasttext-commoncrawl-vec`	fasttext	CommonCrawl
`word2vec-conll`	word2vec	Persian CoNLL17 corpus

Evaluation

We have compared our pos tagging, dependancy parsing, and lemmatization models to stanza and hazm.

PerDT (F1 score)
Toolkit	POS Tagger (UPOS)	Dependancy Parser (UAS/LAS)	Lemmatizer
DadmaTools	97.52%	95.36% / 92.54%	99.14%
stanza	97.35%	93.34% / 91.05%	98.97%
hazm	-	-	89.01%
Seraji (F1 score)
Toolkit	POS Tagger (UPOS)	Dependancy Parser (UAS/LAS)	Lemmatizer
DadmaTools	97.83%	92.5% / 89.23%	-
stanza	97.43%	87.20% / 83.89%	-
hazm	-	-	86.93%

Tehran university tree bank (F1 score)
Toolkit	Constituency Parser
DadmaTools (without preprocess))	82.88%
Stanford (with some preprocess on POS tags)	80.28

Installation

To get started using DadmaTools in your python project, simply install via the pip package. Note that installing the default pip package will not install all NLP libraries because we want you to have the freedom to limit the dependency on what you use. Instead, we provide you with an installation option if you want to install all the required dependencies.

Install with pip

To get started using DadmaTools, simply install the project with pip:

pip install dadmatools

Note that the default installation of DadmaTools does install other NLP libraries such as SpaCy and supar.

You can check the requirements.txt file to see what version the packages has been tested with.

Install from github

Alternatively you can install the latest version from github using:

pip install git+https://github.com/Dadmatech/dadmatools.git

How to use (Colab)

You can see the codes and the output here.

Cite

Will be added in future.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.9

Oct 28, 2024

2.0.8

Jul 28, 2024

2.0.7

Jul 28, 2024

2.0.6

Jun 9, 2024

2.0.5

Jun 8, 2024

2.0.4

May 27, 2024

2.0.3

May 22, 2024

2.0.2

Apr 5, 2024

2.0.1

Feb 24, 2024

1.5.2

Sep 6, 2022

1.5.1

Aug 16, 2022

1.5.0

Jun 15, 2022

1.4.0

Feb 26, 2022

1.3.10

Feb 13, 2022

This version

1.3.8

Feb 9, 2022

1.3.7

Jan 4, 2022

1.3.6

Dec 28, 2021

1.3.1

Dec 26, 2021

1.1.1.dev2 pre-release

Aug 16, 2022

1.1.1.dev1 pre-release

Aug 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dadmatools-1.3.8.tar.gz (720.6 kB view details)

Uploaded Feb 9, 2022 Source

Built Distributions

dadmatools-1.3.8-py3-none-any.whl (848.5 kB view details)

Uploaded Feb 9, 2022 Python 3

dadmatools-1.3.8-py2-none-any.whl (848.5 kB view details)

Uploaded Feb 9, 2022 Python 2

File details

Details for the file dadmatools-1.3.8.tar.gz.

File metadata

Download URL: dadmatools-1.3.8.tar.gz
Upload date: Feb 9, 2022
Size: 720.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for dadmatools-1.3.8.tar.gz
Algorithm	Hash digest
SHA256	`124272a5d4d8b7d49991ec08fd97ac442d8e26c4162a0ce3a200884bb2a18265`
MD5	`d538426c9c1162974d8c7a1f217bf8cf`
BLAKE2b-256	`ecc3f7822919546cb337012d8efcf0df86844d060479a078d34f08cbaf16d727`

See more details on using hashes here.

File details

Details for the file dadmatools-1.3.8-py3-none-any.whl.

File metadata

Download URL: dadmatools-1.3.8-py3-none-any.whl
Upload date: Feb 9, 2022
Size: 848.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for dadmatools-1.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a07f9aaf7dc7de7062171c2b05945a326d18a4841867b279df67ba9f04648850`
MD5	`108f2656661319b2596418416fc8c144`
BLAKE2b-256	`324268c9968f5aa019e9b40d3d17eb0fdd21692ca54df7d86cd77799ed4ed2ff`

See more details on using hashes here.

File details

Details for the file dadmatools-1.3.8-py2-none-any.whl.

File metadata

Download URL: dadmatools-1.3.8-py2-none-any.whl
Upload date: Feb 9, 2022
Size: 848.5 kB
Tags: Python 2
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.8.2 requests/2.27.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/2.7.15rc1

File hashes

Hashes for dadmatools-1.3.8-py2-none-any.whl
Algorithm	Hash digest
SHA256	`9e481bfd9f338e27098160c253f6637b6df146e9d41c6e97bd5feafd3b1966c5`
MD5	`a2b985c6eeb5834ea9d3121dc8144ed1`
BLAKE2b-256	`1da77e335cbbae3d55a5c0c846bee3b9fa7b1045793f1b36b7e2f3ced1eca764`