Skip to main content

TEXTA Entity Linker

Project description

Installation

pip install https://pypi.texta.ee/texta-concatenator/texta-concatenator-latest.tar.gz

Description

Texta EntityLinker servers as a method to link together multiple entities in the form of Texta Facts to create a more concrete and united profile of the personal information which exists in the processed text.

This process will ONLY work on documents which have been previously processed by Texta MLP and contain the "BOUNDED" type Facts.

In addition, the EntityLinker needs an abbreviations.json file which contains key-value pairs of institutions shorthand name and full name. This package comes with a base file which it uses by default but it is always possible to change it by giving the class a filepath of the file you wish to use.

This package has no support for applying Texta MLP, you need to install the package by yourself or apply it on already processed documents.

Usage

Create an instance the class:

from texta_entitylinker.entity_linker import EntityLinker

# Using the built-in abbreviations.json file.
c = EntityLinker()

# or inserting the file by path.
c = EntityLinker(abbr_file="/home/texta/abbreviations.json")

Prepare the input you wish to parse:

Letter 1:

Dear all, 

Let`s not forget that I intend to concure the whole of Persian Empire!

Best wishes,
Alexander Great
aleksandersuur356eKr@mail.ee
phone: 76883266

Letter 2:

От: Terry Pratchett < tpratchett@gmail.com >
Кому: Joe Abercrombie < jabercrombie@gmail.com >
Название: Разъяснение

Дорогой Joe,

Как вы? Надеюсь, у тебя все хорошо. Последний месяц я писал свой новый роман, 
который обещал представить в начале лета. Я тоже немного почитал и обожаю твою 
новую книгу!

Я просто хотел уточнить, что Alexander Great жил в Македонии.

Лучший,
Terry

Letter 3:

Dear Terry!

Terry Pratchett already created Discworld. This name is taken. Other than that I found 
the piece fascanating and see great potential in you! I strongly encourage you to take 
action in publishing your works. Btw, if you would like to show your works to Pratchett 
as well, he`s interested. I talked about you to him. His email is tpratchett@gmail.com. 
Feel free to write him!

Joe


From: Terry Berry < bigfan@gmail.com >
To: Joe Abercrombie < jabercrombie@gmail.com >
Title: Question

Hi Joe,

I finally finished my draft and I`m sending it to you. The hardest part 
was creating new places. What do you think of the names of the places I created?

Terry Berry
 

Process input through the Texta MLP package:

from texta_mlp.mlp import MLP

# This folder should contain all the MLP associated models and data.
# If they don't exists, it will download them and store it at paths location,
# creating directories as needed.

# All the inputs must be processed one by one.
m = MLP(resource_dir="/home/texta/mlp_data")
mlp_analysis = m.process(letter_1)

This process does the basic Entity parsing and creates the BOUNDED facts needed for the entity linking process:

[
 {
  'doc_path': 'text.text',
  'fact': 'EMAIL',
  'lemma': None,
  'spans': '[[114, 142]]',
  'str_val': 'aleksandersuur356eKr@mail.ee'
 },

 {
  'doc_path': 'text.text',
  'fact': 'LOC',
  'lemma': None,
  'spans': '[[67, 81]]',
  'str_val': 'Persian Empire'
 },

 {
  'doc_path': 'text.text',
  'fact': 'BOUNDED',
  'lemma': "{'PER': ['Alexander Great'], 'EMAIL': "
           "['aleksandersuur356ekr@mail.ee'], 'PHONE': ['76883266']}",
  'spans': '[[98, 113], [114, 142], [151, 159]]',
  'str_val': "{'PER': ['Alexander Great'], 'EMAIL': "
             "['aleksandersuur356eKr@mail.ee'], 'PHONE': ['76883266']}"
 },

 {
  'doc_path': 'text.text',
  'fact': 'NAMEMAIL',
  'lemma': None,
  'spans': '[[98, 142]]',
  'str_val': 'Alexander Great aleksandersuur356eKr@mail.ee'
 },

 {
  'doc_path': 'text.text',
  'fact': 'PHONE',
  'lemma': None,
  'spans': '[[151, 159]]',
  'str_val': '76883266'
}
]

Load the batch into the EntityLinker:

# Note that the full result of the MLP process is necessary, 
# not only the texta_facts dictionary.
c.from_json([mlp_letter_1, mlp_letter_2, mlp_letter_3])

Trigger the process for entity linking:

# On larger datasets, this might take a long time.
c.link_entities()

Miscellaneous information:

You can check the length of the database lists and the content with functions:

  • cn._just_pers_infos() (type "close_persons", persons appearing close in letter(s)),
  • cn._bounded() (the original unconcatenated bounded),
  • cn._unsure_infos() (type "unsure_whose_entities", enities that have >=2 candidate persons, not sure to whom it belongs),
  • cn._no_personas_infos() (type "no_per_close_entities", entities appearing close in letter(s) without persons nearby),
  • cn._persona_infos() (type "person_info", the real deal, entities with its person).

Output:

After the .link_entities() function has finished its job, you can view the full results of the entity linking with:

c.to_json()

[
    {"type": "person_info", "PER": "Alexander Great", "LOC": ["Македония", "Persian Empire"], "EMAIL": ["aleksandersuur356eKr@mail.ee"], "PHONE": ["76883266"]}
    {"type": "person_info", "PER": "Joe Abercrombie", "EMAIL": ["jabercrombie@gmail.com"]}
    {"type": "person_info", "PER": "Terry Berry", "EMAIL": ["bigfan@gmail.com"]}
    {"type": "person_info", "PER": "Terry Pratchett", "EMAIL": ["tpratchett@gmail.com"]}
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texta-entity-linker-1.0.2.tar.gz (27.0 kB view details)

Uploaded Source

File details

Details for the file texta-entity-linker-1.0.2.tar.gz.

File metadata

  • Download URL: texta-entity-linker-1.0.2.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.3

File hashes

Hashes for texta-entity-linker-1.0.2.tar.gz
Algorithm Hash digest
SHA256 96fde7acbfcd9c545a34f7a6ad6619fa05c9a489ae2d751f74c07194ae200e3b
MD5 81dda507a21e7979aa52c6f11385c852
BLAKE2b-256 68b63013ab3190cef610641b21dd936328c9cceb9973cb020482f0c75bf58dd7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page