A CRF layer and BI-LSTM+CRF model implemented in Keras

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

ArabicTagger

ArabicTagger is a Python package that has the following components:-
1- a CRF layer implemented in Keras
2- a BI-LSTM + CRF model implemneted in keras
3- Build and train your own Arabic NER models using pre-existing models with minimal lines of code and only the desired tags

installation

# ArabicTagger is still in its beta version   
# it's recommended to install it inside its own environment
pip install ArabicTagger

test

from ArabicTagger import Tagger,NER,CRF
tagger = Tagger()
tagger.intialize_models()
inputs = [['السلام', 'عليكم', 'كم', 'سعر', 'الخلاط'],
         ['ما', 'هي', 'مواصفات', 'البوتجاز', 'الي', 'في', 'الصورة']]
tags =  [['DEVICE', 'O', 'O', 'O', 'O'],
         ['O', 'O','O', 'DEVICE', 'O', 'O', 'O']]
# define udt
user_defined_tags = ['DEVICE']
train1, _ = tagger.get_data(inputs, 7, tags, user_defined_tags)
X1,Y1 = train1
model = NER(20, 13, 7, 300, udt = [13])
model.compile(optimizer=tf.keras.optimizers.Adam(0.05))
model.fit([X1,Y1], Y1, epochs = 4 , batch_size = len(X1))

CRF

CRF is a Keras layer that has been created using subclassing, the call of the layer takes a list or tuple of the inputs and the output of shape (n+2, m) (n+2,) respectively another optional parameter is return_loss which is set to True by default. If return_loss is set to false the loss will be added to the final loss at the output layer of the model.

NER

NER is the BI-LSTM + CRF model, this model goes beyond being a simple model but it has additional metrics defined inside it like udt_accuracy which calculates the total accuracy based on the user-defined tags this will be clear in the Tagger section

Tagger

The Tagger module is a valuable asset for NLP tasks, particularly Named Entity Recognition (NER). It empowers users to create custom NER models by simply annotating their data. This annotation process involves labeling specific words or phrases as interest entities (e.g., "DEVICE").

The Challenge of Limited Data

Training an effective NER model often requires a substantial amount of annotated data. However, when dealing with limited datasets, especially those with a single dominant tag, achieving high accuracy can be challenging. This is because the model struggles to identify underlying patterns or structures within the data.

Tagger's Solution: Tag Expansion

The Tagger module addresses this limitation by generating additional tags using pre-trained models. This process, known as tag expansion, enriches the training data and helps the model discover hidden patterns.

Introducing Our Pre-trained Models
We offer two pre-trained models to facilitate tag expansion:
let's say you are building a NLP model that extracts the Device name from customers reviews 'Arabic Text!' in order to do this you need anotated data where each word to be Device or not will be somthing like this : ['السلام', 'عليكم', 'كم', 'سعر', 'الخلاط'] and will be anotated like this ['DEVICE', 'O', 'O', 'O', 'O'].
if you tried different models to predict the outputs correctly,you will get very low accuracy that's because you don't have much training data and there is only one tag which make it harder for models to find some kind of structure.
Tagger module will enable you to generate more tags around your tag using pre-trained models, so that the model can capture some structure behind the data.currently we present two models :-
1 - CRF_model_1 (Part-of-speech model has 12 tags)
the following data sets has been combined then splited for training :-

arabic_pos_dialect(egy)

- Arabic Data in universaldependencies

Tag	Description	Example
V	Verb	فعل (fi'l) - to do, to make
ADJ	Adjective	صفة (sifa) - describing word
PART	Particle	حرف (harf) - small word with grammatical function
PRON	Pronoun	ضمير (ḍamīr) - word used instead of a noun
NUM	number	رقم (raqm) - numeral
PREP	Preposition	حرف جر (harf jar) - word used before a noun to show relationship
PUNC	punctuation	علامة ترقيم (ʿalamāt tarqīm) - punctuation mark
DET	Determiner	أل (al) - definite article
O	object	Outside of any named entity
ADV	Adverb	ظرف (ẓarf) - word that modifies a verb, adjective, or adverb
CONJ	Conjunction	حرف عطف (harf ʿaṭf) - word that connects words or sentences
NOUN	Noun	اسم (ism) - word that names a person, place, thing, or idea

2- CRF_model_2 (Named entity model has 7 tags)

Tag	Description
I-LOC	Inside a location entity, such as a country, city, or landmark.
B-LOC	Beginning of a location entity.
I-PER	Inside a person entity, such as a first name or last name.
B-PERS	Beginning of a person entity.
I-ORG	Inside an organization entity, such as a company or institution.
B-ORG	Beginning of an organization entity.
O	Outside of any named entity, not part of any location, person, or organization.

the following data set has been sampled then splited for training :-

KALIMAT a Multipurpose Arabic Corpus(NER)

Evaluation Results for Food Extraction

The proposed method was evaluated on a dataset of 600 food extraction examples. Using the first model, we achieved the following results:
Training: Accuracy: 0.9797, udt_accuracy: 0.9589
Testing: Accuracy: 0.9429, udt_accuracy: 0.9291
When using the second model, the results were:
Training: Accuracy: 0.9596, udt_accuracy: 0.9611
Testing: Accuracy: 0.9526, udt_accuracy: 0.9223
Overall, both models demonstrated strong performance in extracting food-related entities from the given dataset.

another example has been provided in kaggle notebook to show how to train model to tag a stop word

References

1- https://aclanthology.org/J96-1002.pdf
2- https://cseweb.ucsd.edu/~elkan/250Bfall2007/loglinear.pdf

Arabic Text Matcher

TextMatcher is a module designed to address a common problem in Arabic NLP:
Suppose you’re building a model to handle customer orders and process transactions automatically. You've implemented a Named Entity Recognition (NER) model to extract device names in Arabic, such as (غسالة, تلفاز, شاشة, مروحة). After identifying these devices, you need to match them with entries in your database to check stock availability.

However, issues arise when there's a spelling variation. For instance, a customer might type "غساله," but in your database, the device is listed as "غسالة." A standard search would fail to match these two, even though they represent the same item.

This issue is known as orthographic or spelling variation and is a challenge for information retrieval in Arabic. Many approaches exist to address this, and in the ArabicTagger package, I approached the problem with a learnable weighted similarity. This method reduces the number of parameters and shortens training time while improving matching accuracy across spelling variations.

Let’s define a set of tuples each tuple has two words the first is the word normalized one and the second word is the other variant e.x

K = {(“ غسالة”,” غساله”),(“مروحة”,”المروحه”),…….}

And another set of tuples each tuple has two words that are dissimilar to each other e.x

J = {(“ غسالة”,”مروحة”),(“مروحة”,”تلاجة”),…….}

Let S_k be the similarity between two elements in the set K, Sj be the similarity between the elements in the set J such that:

$S_{k} = \frac{\sum_{i = 1}^{m}{W_{i}^{2}{\ I}_{ik}}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}}\ \ $

$S_{j} = \ \frac{\sum_{i = 1}^{m}{W_{i}^{2}{\ I}_{ij}}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}}\ \ \ \ $

Where m is the vector length, W_i is a learnable parameter for each position in that vector.

We want to maximize the average similarity of pairs in the set K and simultaneously minimize the average similarity of pairs in the set J, we also want the weights W_i to sum up to 1 (this constraint can be relaxed) the problem can be formulated as follows.

$\max_{W_{i}}{\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}S_{k} - \ \frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}S_{j}}\ \ $

$s.t\ \ \ \ \ \ \ \sum_{i = 1}^{m}{W_{i} = 1}$

This problem can be solved using the Lagrange multipliers method as follows :

$L\left( W_{i},\ \ \lambda \right) = \ \frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}S_{k} - \ \frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}S_{j} - \lambda\ (\sum_{i = 1}^{m}{W_{i} - 1})$.

$L\left( W_{i},\ \ \lambda \right) = \ \frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}\frac{\sum_{i = 1}^{m}{W_{i}^{2}{\ I}_{ik}}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}} - \ \frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{\sum_{i = 1}^{m}{W_{i}^{2}{\ I}_{ij}}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}}\ } - \lambda\ (\sum_{i = 1}^{m}{W_{i} - 1})$.

$\nabla\ L = \ \begin{bmatrix} \frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{2\ W_{i}\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{{\ 2\ W}_{i}\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}}\ - \ \lambda\ \ } \\ \sum_{i = 1}^{m}{W_{i} - 1} \end{bmatrix} = \ \begin{bmatrix} 0 \\ 0 \end{bmatrix}\ $

$2\ W_{i}\ (\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ = \ \lambda\ }$

$W_{i}\ = \ \frac{\lambda}{2}{(\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ \ }}^{- 1}$ (1)

${\sum_{i = \ 1}^{m}W}_{i} = \frac{\lambda}{2}\ \sum_{i = \ 1}^{m}{(\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ }}^{- 1}$ .

From the second constraint $\sum_{i = 1}^{m}{W_{i} = 1}$ then,

$1 = \frac{\lambda}{2}\ \sum_{i = \ 1}^{m}{(\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ }}^{- 1}$ .

$\lambda = 2\ {(\sum_{i = \ 1}^{m}{(\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ }}^{- 1})}^{- 1} = \ 2N_{\mathcal{f}}$

Where $N_{\mathcal{f}}\ $Is a normalization factor.

Substitute $\lambda\ $in equation (1) to get $W_{i}$ As follows :

$W_{i}\ = \ N_{\mathcal{f}}{(\frac{1}{n_{1}}\ \sum_{k = 1}^{n_{1}}{\frac{{\ \ I}_{ik}\ I_{ik}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ik}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ik}^{\sim}}^{2}}}\ - \ }\frac{1}{n_{2}}\ \sum_{j = 1}^{n_{2}}{\frac{{\ \ I}_{ij}\ I_{ij}^{\sim}}{\sqrt{\sum_{i = 1}^{m}I_{ij}^{2}}\ \sqrt{\sum_{i = 1}^{m}{I_{ij}^{\sim}}^{2}}})\ \ }}^{- 1}$

That simply means the model will estimate each weight. $W_{i}$ Based on the inverse of the similarity of both sets K and J at that position i.

This approach is similar to the attention mechanism, where the model focuses on positions that lead to high similarity in set K and low similarity in set J. The positions are defined using a specific strategy in TextMatcher as follows:

{

'ا': 0,

'ب': 1,

'ت': 2,

'ة': 3,

'ث': 4,

'ج': 5,

'ح': 6,

...

'اه': 61,

'او': 62,

'اي': 63,

'اء': 64,

'اآ': 65,

'اأ': 66,

'اؤ': 67,

'اإ': 68,

...

}

The dictionary contains 1,260 entries. We initialize an empty vector of size 1,260, then loop through each word's 1-grams and 2-grams. For each match, we update the corresponding vector position using the following formula:

Vector[i] = Vector[i] + 1 + func(i)

Here, func(i) is a position-encoding function that adjusts the value based on the position of the character or 2-gram within the word.

In future versions, I hope to add other methods to deal with this problem

For more info about how to use TextMatcher see Kaggle notebook at GitHub.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.7b1 pre-release

Sep 15, 2024

0.1.6b1 pre-release

Sep 8, 2024

0.1.5b1 pre-release

Sep 6, 2024

0.1.4b1 pre-release

Sep 2, 2024

0.1.3b1 pre-release

Sep 2, 2024

0.1.2b3 pre-release

Sep 30, 2024

This version

0.1.2b2 pre-release

Sep 16, 2024

0.1.2b1 pre-release

Sep 2, 2024

0.1.1

Oct 2, 2024

0.1.1b3 pre-release

Sep 30, 2024

0.1.1b2 pre-release

Sep 16, 2024

0.1.1b1 pre-release

Sep 2, 2024

0.1.0b1 pre-release

Sep 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabictagger-0.1.2b2.tar.gz (421.6 kB view details)

Uploaded Sep 16, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ArabicTagger-0.1.2b2-py3-none-any.whl (420.6 kB view details)

Uploaded Sep 16, 2024 Python 3

File details

Details for the file arabictagger-0.1.2b2.tar.gz.

File metadata

Download URL: arabictagger-0.1.2b2.tar.gz
Upload date: Sep 16, 2024
Size: 421.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for arabictagger-0.1.2b2.tar.gz
Algorithm	Hash digest
SHA256	`05bb83f9dcb52a5aaa21f3b83d535aa0fc87f25ac3fe9db0aef77c0f05d52153`
MD5	`f1c3ed643cc599ef49eb8975da3a0fda`
BLAKE2b-256	`37aed9dbffea678b031117c537fc8817ab06a8f2d6ff844d600112330348ba5f`

See more details on using hashes here.

File details

Details for the file ArabicTagger-0.1.2b2-py3-none-any.whl.

File metadata

Download URL: ArabicTagger-0.1.2b2-py3-none-any.whl
Upload date: Sep 16, 2024
Size: 420.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for ArabicTagger-0.1.2b2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`767f90a7abfc9cf5d5bae709ff925ce61ce3c3b0f079d018d469c732a0d8baf1`
MD5	`24b7c19286acc601148df510460f0bf5`
BLAKE2b-256	`f4335e39acfcba1c01026f00bd1aba5c8213858d70ea67f369c4dc9eab15be4e`

See more details on using hashes here.

ArabicTagger 0.1.2b2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ArabicTagger

installation

test

CRF

NER

Tagger

another example has been provided in kaggle notebook to show how to train model to tag a stop word

References

Arabic Text Matcher

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes