Skip to main content

Weak NER Model

Project description

Weak Named Entity Recognition (NER) Model

This package utilizes three systems for labeling named entities in text. The first system uses files containing lists of words and expressions of a certain NER Entity. The second system uses regex patters to recognize part of speech patterns. The third system uses rules to break ambiguity when the first two systems cannot decide on a label.

Installation

The Weak NER can be installed from PyPi:

pip install weak_ner

Usage

NER Classes Accepted

This project utilizes the following NER labels and their tokens are as follow:

'Financial': 'FIN',
'Generic': 'GEN',
'Company': 'COMP',
'Number': 'NUMBER',
'Document': 'DOC',
'Location': 'LOC',
'Person': 'PERS',
'Phone': 'PHONE',
'Address': 'ADDR',
'Email': 'EMAIL',
'Date': 'DATE',
'Week Day': 'WD',
'Money': 'MONEY',
'Relatives': 'REL',
'Vocatives': 'VOC'

Some additional information is used to identify where the recognized entity begins and ends.

The letter B indicates the beginning of the CLASS class entity
The letter I indicates that the respective token is a continuation of the class with the name CLASS started
The letter O indicates that no entity related to the token was found

For example, the sentence ligar internet a cabo! would be classified as: O O B-GEN I-GEN I-GEN O.

Where B-GEN represents the beginning of the GEN entity (token "internet") and the next two tokens are the continuation of the entity (tokens "a cabo"). In this way, the entity found in the sentence would be "internet a cabo" of the GEN class

Text Pre Processing

All text used is pre processed by the default utilizing the following operations:

  • Case lowering
  • Adding space around punctuation
  • Removing non-ASCII characters

It can also optionally perform the following tokenizations:

  • E-mails
  • Urls
  • Numbers
  • Codes

In order to use the optional pre processing the user needs to pass a list containing EMAIL, URL, NUMBER and/or CODE. It can be passed in the instatiation of the class as shown bellow:

tokenization_options = ['EMAIL', 'CODE']
weak_ner = WeakNER('directory_path/', tokenization_options)

List Based Model

Files

In order to label a string using the list based model the user needs to create a directory containing the following files:

substantivos_meses  
substantivos_nomes  
substantivos_sobrenome
substantivos_empresas
substantivos_empresas_internacionais
substantivos_documentos
substantivos_vocativos
substantivos_paises
substantivos_cidades
substantivos_continentes
substantivos_estados
substantivos_financeiros
substantivos_dias_da_semana
substantivos_animais
substantivos_parentescos
substantivos_carros
pronomes
artigos
preposicoes
interjeicoes 

In which the contents should be one word of that class per line. For example the file artigos.txt would contain the words:

a  
no  
nas
nas
num 
numa
nuns 
numas

Weak Labeling

The default weak labeling utilizes two steps:

  • The first step is the labeling module WeakNERModel created with the files input in the class.
  • The second step is the label correction module WeakNERRules.

In order to use the default pipeline to label a sentence the user needs to first instantiate the class passing the path of the directory where the files are stored. The user then can use this class to label a sentence by passing it and its POS Tags to the class:

weak_ner = WeakNER('directory_path/')
sentence = "meu nome é Gabriel"
postags = 'PRON SUBS VERB SUBS'
labeled_sentence = weak_ner.label_sentence(sentences, postags)

And the user should receive back the result:

'O B-GEN O B-PERS'

The user can also specify which optional text pre processing will be applied on the sentence:

tokenization_options = ['EMAIL', 'CODE']
weak_ner = WeakNER('directory_path/', tokenization_options)
 sentence = "meu nome é Gabriel e meu contato é research@email.com"
postags = 'PRON SUBS VERB SUBS PREP PRON SUBS VERB SUBS'
labeled_sentence = weak_ner.label_sentence(sentences, postags)

And the user should receive back the result:

'O B-GEN O B-PERS O O O O O B-EMAIL'

Contribute

If this is the first time you are contributing to this project, first create the virtual environment using the following command:

conda env create -f env/environment.yml

Then activate the environment:

conda activate weakner_env

To test your modifications build the package:

pip install dist\weak_ner-0.0.1-py3-none-any.whl --force-reinstall

Then run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weak-ner-0.0.1.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

weak_ner-0.0.1-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file weak-ner-0.0.1.tar.gz.

File metadata

  • Download URL: weak-ner-0.0.1.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for weak-ner-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b95f7b2bab28bbb60730665fbc0e20b8a024d8a4d4cdc122738b17bdcfe7bc2d
MD5 895c1728fe9c960c6e4d4e84e06eda04
BLAKE2b-256 a4bbcd0d8b8ee4bbed8c3c70b81ead5eb0417260fe4adfc283d4db60ec028318

See more details on using hashes here.

File details

Details for the file weak_ner-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: weak_ner-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for weak_ner-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 597c9930f8c16ff55df1b03c02683d996f01eaac2204fd3143afcac17401d7f8
MD5 6ec968284d29a104a2d5d65aa82ce646
BLAKE2b-256 43bd26e35fa33385245cb889c53894f88a9588116d369c3de33879b0e02e3ce5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page