Skip to main content

No project description provided

Project description

nmatheg

nmatheg نماذج an easy straregy for training Arabic NLP models on huggingface datasets. Just specifiy the name of the dataset, preprocessing, tokenization and the training procedure in the config file to train an nlp model for that task.

install

pip install nmatheg

Configuration

Setup a config file for the training strategy.

dataset_name = ajgt_twitter_ar

[preprocessing]
segment = False
remove_special_chars = False
remove_english = False
normalize = False
remove_diacritics = False
excluded_chars = []
remove_tatweel = False
remove_html_elements = False
remove_links = False 
remove_twitter_meta = False
remove_long_words = False
remove_repeated_chars = False

[tokenization]
tokenizer_name = WordTokenizer
vocab_size = 1000
max_tokens = 128

[model]
model_name = rnn

[log]
print_every = 10

[train]
save_dir = .
epochs = 10
batch_size = 256 

Main Sections

  • dataset describe the dataset and the task type. Currently we only support classification
  • preprocessing a set of cleaning functions mainly uses our library tnkeeh.
  • tokenization descrbies the tokenizer used for encoding the dataset. It uses our library tkseem.
  • train the training parameters like number of epochs and batch size.

Usage

Config Files

import nmatheg as nm
strategy = nm.TrainStrategy('config.ini')
strategy.start()

Benchmarking on multiple datasets and models

import nmatheg as nm
strategy = nm.TrainStrategy(
    datasets = 'arsentd_lev,arcd,caner', 
    models   = 'qarib/bert-base-qarib,aubmindlab/bert-base-arabertv01'
)
strategy.start()

Datasets

We are supporting huggingface datasets for Arabic. You can find the supported datasets here.

Dataset Description
ajgt_twitter_ar Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
metrec The dataset contains the verses and their corresponding meter classes. Meter classes are represented as numbers from 0 to 13. The dataset can be highly useful for further research in order to improve the field of Arabic poems’ meter classification. The train dataset contains 47,124 records and the test dataset contains 8,316 records.
labr This dataset contains over 63,000 book reviews in Arabic. It is the largest sentiment analysis dataset for Arabic to-date. The book reviews were harvested from the website Goodreads during the month or March 2013. Each book review comes with the goodreads review id, the user id, the book id, the rating (1 to 5) and the text of the review.
ar_res_reviews Dataset of 8364 restaurant reviews from qaym.com in Arabic for sentiment analysis
arsentd_lev The Arabic Sentiment Twitter Dataset for Levantine dialect (ArSenTD-LEV) contains 4,000 tweets written in Arabic and equally retrieved from Jordan, Lebanon, Palestine and Syria.
oclar The researchers of OCLAR Marwan et al. (2019), they gathered Arabic costumer reviews Zomato website on wide scope of domain, including restaurants, hotels, hospitals, local shops, etc. The corpus finally contains 3916 reviews in 5-rating scale. For this research purpose, the positive class considers rating stars from 5 to 3 of 3465 reviews, and the negative class is represented from values of 1 and 2 of about 451 texts.
emotone_ar Dataset of 10,065 tweets in Arabic for Emotion detection in Arabic text
hard This dataset contains 93,700 hotel reviews in Arabic language.The hotel reviews were collected from Booking.com website during June/July 2016.The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.The following table summarize some tatistics on the HARD Dataset.
caner The Classical Arabic Named Entity Recognition corpus is a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities.
arcd Arabic Reading Comprehension Dataset (ARCD) composed of 1,395 questions posed by crowdworkers on Wikipedia articles.

Tasks

Currently we support text classification, named entity recognition and question answering.

Demo

Check this colab notebook for a quick demo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmatheg-0.0.4.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nmatheg-0.0.4-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file nmatheg-0.0.4.tar.gz.

File metadata

  • Download URL: nmatheg-0.0.4.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for nmatheg-0.0.4.tar.gz
Algorithm Hash digest
SHA256 163db600cdd0531de25c9f34db43df66329d8a66c9bf7ffa9140f2a8e01a9135
MD5 08fb25708dab62bb87c91afa8d87db5e
BLAKE2b-256 51a2651aec9e36445e8944645907ff1c11ffb1f21f2c7c8a8aabea93c5ff4f4f

See more details on using hashes here.

File details

Details for the file nmatheg-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: nmatheg-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for nmatheg-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 edab7ca57cd5d8bcde2b2225e9bf69503704e880d15e7bdb89d3e84cb9f7d66d
MD5 d9386df02aae6b4f3686efbbd5f9c573
BLAKE2b-256 b28900a1933dc4cc8dc4216b7247e8791675222e7b283e27dd3cc23ab73ccaf6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page