Skip to main content

A CRF layer and BI-LSTM+CRF model implemented in Keras

Project description

ArabicTagger

PythonTensorFlowKeras Kaggle

ArabicTagger is a Python package that has the following components:-
1- a CRF layer implemented in Keras
    2- a BI-LSTM + CRF model implemneted in keras
3- Build and train your own Arabic NER models using pre-existing models with minimal lines of code and only the desired tags


installation

# ArabicTagger is still in its beta version   
# it's recommended to install it inside its own environment
pip install ArabicTagger

test

from ArabicTagger import Tagger,NER,CRF
tagger = Tagger()
tagger.intialize_models()
inputs = [['السلام', 'عليكم', 'كم', 'سعر', 'الخلاط'],
         ['ما', 'هي', 'مواصفات', 'البوتجاز', 'الي', 'في', 'الصورة']]
tags =  [['DEVICE', 'O', 'O', 'O', 'O'],
         ['O', 'O','O', 'DEVICE', 'O', 'O', 'O']]
# define udt
user_defined_tags = ['DEVICE']
train1, _ = tagger.get_data(inputs, 7, tags, user_defined_tags)
X1,Y1 = train1
model = NER(20, 13, 7, 300, udt = [13])
model.compile(optimizer=tf.keras.optimizers.Adam(0.05))
model.fit([X1,Y1], Y1, epochs = 4 , batch_size = len(X1))

CRF

CRF is a Keras layer that has been created using subclassing, the call of the layer takes a list or tuple of the inputs and the output of shape (n+2, m) (n+2,) respectively another optional parameter is return_loss which is set to True by default. If return_loss is set to false the loss will be added to the final loss at the output layer of the model.


NER

NER is the BI-LSTM + CRF model, this model goes beyond being a simple model but it has additional metrics defined inside it like udt_accuracy which calculates the total accuracy based on the user-defined tags this will be clear in the Tagger section


Tagger

The Tagger module is a valuable asset for NLP tasks, particularly Named Entity Recognition (NER). It empowers users to create custom NER models by simply annotating their data. This annotation process involves labeling specific words or phrases as interest entities (e.g., "DEVICE").

The Challenge of Limited Data


Training an effective NER model often requires a substantial amount of annotated data. However, when dealing with limited datasets, especially those with a single dominant tag, achieving high accuracy can be challenging. This is because the model struggles to identify underlying patterns or structures within the data.

Tagger's Solution: Tag Expansion


The Tagger module addresses this limitation by generating additional tags using pre-trained models. This process, known as tag expansion, enriches the training data and helps the model discover hidden patterns.

Introducing Our Pre-trained Models
We offer two pre-trained models to facilitate tag expansion:
let's say you are building a NLP model that extracts the Device name from customers reviews 'Arabic Text!' in order to do this you need anotated data where each word to be Device or not will be somthing like this : ['السلام', 'عليكم', 'كم', 'سعر', 'الخلاط'] and will be anotated like this ['DEVICE', 'O', 'O', 'O', 'O'].
if you tried different models to predict the outputs correctly,you will get very low accuracy that's because you don't have much training data and there is only one tag which make it harder for models to find some kind of structure.
Tagger module will enable you to generate more tags around your tag using pre-trained models, so that the model can capture some structure behind the data.currently we present two models :-
1 - CRF_model_1 (Part-of-speech model has 12 tags)
the following data sets has been combined then splited for training :-


- Arabic Data in universaldependencies
Tag Description Example
V Verb فعل (fi'l) - to do, to make
ADJ Adjective صفة (sifa) - describing word
PART Particle حرف (harf) - small word with grammatical function
PRON Pronoun ضمير (ḍamīr) - word used instead of a noun
NUM number رقم (raqm) - numeral
PREP Preposition حرف جر (harf jar) - word used before a noun to show relationship
PUNC punctuation علامة ترقيم (ʿalamāt tarqīm) - punctuation mark
DET Determiner أل (al) - definite article
O object Outside of any named entity
ADV Adverb ظرف (ẓarf) - word that modifies a verb, adjective, or adverb
CONJ Conjunction حرف عطف (harf ʿaṭf) - word that connects words or sentences
NOUN Noun اسم (ism) - word that names a person, place, thing, or idea

2- CRF_model_2 (Named entity model has 7 tags)

Tag Description
I-LOC Inside a location entity, such as a country, city, or landmark.
B-LOC Beginning of a location entity.
I-PER Inside a person entity, such as a first name or last name.
B-PERS Beginning of a person entity.
I-ORG Inside an organization entity, such as a company or institution.
B-ORG Beginning of an organization entity.
O Outside of any named entity, not part of any location, person, or organization.

the following data set has been sampled then splited for training :-


Evaluation Results for Food Extraction

The proposed method was evaluated on a dataset of 600 food extraction examples. Using the first model, we achieved the following results:
Training: Accuracy: 0.9797, udt_accuracy: 0.9589
Testing: Accuracy: 0.9429, udt_accuracy: 0.9291
When using the second model, the results were:
Training: Accuracy: 0.9596, udt_accuracy: 0.9611
Testing: Accuracy: 0.9526, udt_accuracy: 0.9223
Overall, both models demonstrated strong performance in extracting food-related entities from the given dataset.

another example has been provided in kaggle notebook to show how to train model to tag a stop word

References


1- https://aclanthology.org/J96-1002.pdf
2- https://cseweb.ucsd.edu/~elkan/250Bfall2007/loglinear.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabictagger-0.1.7b1.tar.gz (403.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ArabicTagger-0.1.7b1-py3-none-any.whl (402.4 kB view details)

Uploaded Python 3

File details

Details for the file arabictagger-0.1.7b1.tar.gz.

File metadata

  • Download URL: arabictagger-0.1.7b1.tar.gz
  • Upload date:
  • Size: 403.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for arabictagger-0.1.7b1.tar.gz
Algorithm Hash digest
SHA256 9046861a2c40427ecf1127796474e8c85c78061410c931303362007ad622d17b
MD5 1618ae21b2a022fe87665be2b29b3c8c
BLAKE2b-256 5658cb0eb4f326f29287557c38f5b9e3b649697d3a3450e545a1f2361a9e3496

See more details on using hashes here.

File details

Details for the file ArabicTagger-0.1.7b1-py3-none-any.whl.

File metadata

  • Download URL: ArabicTagger-0.1.7b1-py3-none-any.whl
  • Upload date:
  • Size: 402.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for ArabicTagger-0.1.7b1-py3-none-any.whl
Algorithm Hash digest
SHA256 408a89a1722fa6d391f12ee4e2e8cc9cc9b6cb6d95309606f936414684a4aef7
MD5 708d90cba2bc0d543c66679a3f3394f5
BLAKE2b-256 e836427dada8e371a3763357885055c294ae098b13b7450a6ce28af3c409132d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page