Skip to main content

A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.

Project description

spaCy fishing

Python Version PyPI version License: MIT Tests Built with spaCy

A spaCy wrapper of Entity-Fishing, a tool for named entity disambiguation and linking on Wikidata.

This extension allows using Entity-Fishing tool as a spaCy pipeline component to disambiguate and link named entities (with custom or pretrained NER spaCy models) to Wikidata knowledge base.

Table of contents

Installation

normal

pip install spacyfishing

development

git clone https://github.com/Lucaterre/spacyfishing.git
virtualenv --python=/usr/bin/python3.8 venv
source venv/bin/activate
pip install -r requirements.txt

Usage

First, install a pre-trained spaCy language model for the NER task:

python -m spacy download en_core_web_sm

Note that it is possible to use custom NER model.

Simple example

import spacy 

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

nlp_model_en.add_pipe("entityfishing")

doc_en = nlp_model_en(text_en)

for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)

Get Wikidata description

By default, the Wikidata entity description is not provided, to access it, specify True in the description_required parameter in the component configuration:

import spacy 

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

# specify configuration:
nlp_model_en.add_pipe("entityfishing", config={"description_required": True})

doc_en = nlp_model_en(text_en)

# Access to description with ent._.description:
for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.description))
('Victor Hugo', 'PERSON', 'Q535', "'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').")
('Honoré de Balzac', 'PERSON', 'Q9711', "'''Honoré de Balzac''' (;, born '''Honoré Balzac''', 20 May 1799 \xa0 – 18 August 1850) was a French novelist and playwright. The [[novel sequence]] ''[[La Comédie Humaine]]'', which presents a panorama of [[Napoleonic Era|post-Napoleonic]] French life, is generally viewed as his ''[[Masterpiece|magnum opus]]''.")
('French', 'NORP', 'Q121842', "The '''French''' are an [[ethnic group]] and [[nation]] who are identified with the country of [[France]]. This connection may be legal, historical, or cultural.")
('Paris', 'GPE', 'Q90', "'''Paris''' is the [[Capital city|capital]] and most populous [[city]] of [[France]], with an administrative-limits area of and a 2015 population of 2,229,621. The city is a [[Communes of France|commune]] and [[Departments of France|department]], and the capital-heart of the [[Île-de-France]] ''[[Region in France|region]]'' (colloquially known as the 'Paris Region'), whose 12,142,802 2016 population represents roughly 18 percent of the population of France. By the 17th century, Paris had become one of Europe's major centres of finance, commerce, fashion, science, and the arts, a position that it retains still today. The Paris Region had a [[GDP]] of €649.6 billion (US $763.4 billion) in 2014, accounting for 30.4 percent of the GDP of France. According to official estimates, in 2013-14 the Paris Region had [[List of cities by GDP|the third-highest GDP in the world and the largest regional GDP in the EU]].")

Use other languages

By default, disambiguation model resources are set to English, to use other languages, specify the language code in the language parameter in the component configuration:

import spacy 

text_fr = "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel."

nlp_model_fr = spacy.load("fr_core_news_sm")

nlp_model_fr.add_pipe("entityfishing", config={"language": "fr"})

doc_fr = nlp_model_fr(text_fr)

for ent in doc_fr.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata))
("bataille d'El-Alamein", 'MISC', 'Q153376', 'https://www.wikidata.org/wiki/Q153376')
('Égypte', 'LOC', 'Q79', 'https://www.wikidata.org/wiki/Q79')
('Bernard Montgomery', 'PER', 'Q152025', 'https://www.wikidata.org/wiki/Q152025')
('Erwin Rommel', 'PER', 'Q14060', 'https://www.wikidata.org/wiki/Q14060')

To consult the languages available in Entity-Fishing follow this link.

Get information about Entity-fishing API response

The raw response of the Entity-fishing API can be accessed in the doc object:

doc._.annotations
{'runtime': 18, 
 'nbest': False, 
 'text': "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel.", 
 'language': {'lang': 'fr', 'conf': 0.0}, 
 'global_categories': [
  {'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': 'Général allemand', 'page_id': 94081}, 
  {'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': "Chevalier grand-croix de l'ordre du Bain", 'page_id': 3399090}, 
  {'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': 'Naissance à Heidenheim an der Brenz', 'page_id': 8396804}
  ], 
  'entities': [
  {'rawName': "bataille d'El-Alamein", 'offsetStart': 1, 'offsetEnd': 6, 'nerd_score': 1, 'nerd_selection_score': 0.9198, 'wikipediaExternalRef': 261761, 'wikidataId': 'Q153376', 'domains': ['Military']},
  {'rawName': 'Égypte', 'offsetStart': 7, 'offsetEnd': 8, 'nerd_score': 1, 'nerd_selection_score': 0.6437, 'wikipediaExternalRef': 4011, 'wikidataId': 'Q79', 'domains': ['Geology']}, 
  {'rawName': 'Bernard Montgomery', 'offsetStart': 15, 'offsetEnd': 17, 'nerd_score': 1, 'nerd_selection_score': 0.9965, 'wikipediaExternalRef': 46225, 'wikidataId': 'Q152025', 'domains': ['Biology', 'Military']}, 
  {'rawName': 'Erwin Rommel', 'offsetStart': 20, 'offsetEnd': 22, 'nerd_score': 1, 'nerd_selection_score': 0.9955, 'wikipediaExternalRef': 46221, 'wikidataId': 'Q14060', 'domains': ['Military']}
  ]
 }

To access to query metadata for the response by Entity-Fishing API :

doc._.metadata
{'status_code': 200, 'reason': 'OK', 'ok': True, 'encoding': 'utf-8'}

Configuration parameters

- api_ef_base          : URL of the entity-fishing API endpoint. Defaults to Huma-Num server.
- language             : Specify language of KB ressources for entity-fishing API. Defaults to "en".
- description_required : Get Wikidata description from service "concept look-up" of entity-fishing API. Defaults to false. 

Attributes


  • Doc extensions:

    doc._.annotations :  Raw response from Entity-Fishing API.
    doc._.metadata    :  Raw information about request and response from Entity-Fishing API.
    
  • Span extensions:

    span._.kb_qid       : Wikidata identifier (QID). 
    span._.description  : Short Wikidata description.
    span._.url_wikidata : URL to Wikidata ressource.
    span._.nerd_score   : Selection confidence score for the disambiguated entity.
    

For details on Entity-Fishing response, follow this link

Recommendations

If Entity-fishing is deployed locally or on specific server, specify the URL of the new Entity-fishing API endpoint in the config:

nlp.add_pipe("entityfishing", config={"api_ef_base": "<api-endpoint>"})

This can be useful, if you work with more recent dumps of Wikidata knowledge base or increase speed in the pipeline.

To install your own instance of Entity-Fishing, follow this link.

Visualise results

To visualize the named entities and their wikidata links, use the manual option of displaCy:

import spacy 

text_fr = "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel."
nlp_model_fr = spacy.load("fr_core_news_sm")
nlp_model_fr.add_pipe("entityfishing", config={"language": "fr"})
doc_fr = nlp_model_fr(text_fr)

options = {
    "ents": ["MISC", "LOC", "PER"],
    "colors": {"LOC": "#82e0aa", "PER": "#85c1e9", "MISC": "#f0b27a"}
}

params = {"text": doc_fr.text,
          "ents": [{"start": ent.start_char,
                    "end": ent.end_char,
                    "label": ent.label_,
                    "kb_id": ent._.kb_qid,
                    "kb_url": ent._.url_wikidata}
                     for ent in doc_fr.ents],
          "title": None}

spacy.displacy.serve(params, style="ent", manual=True, options=options)
vizualizer

The visualizer is serving on http://0.0.0.0:5000

External Ressources

Specifications

This component is experimental, it may be used for research, and it may evolve in the future.

The structure component was inspired by another spacy extension called spaCyOpenTapioca.

Entity-Fishing is tool created by Patrice Lopez (SCIENCE-MINER), with contributions of Inria Paris, and distributed under Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacyfishing-0.1.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

spacyfishing-0.1.0-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file spacyfishing-0.1.0.tar.gz.

File metadata

  • Download URL: spacyfishing-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for spacyfishing-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1744a89ed50af7a7a2f06f3b3975e82c4b81eb61cd0ad38e3caa0a5c79a9e186
MD5 a7f53729490c769fc06360138f9e19b2
BLAKE2b-256 393ff862e35a4ddccd2cd0f0ed339b3ba39ccd1b74a9314cce402a84d6302d7e

See more details on using hashes here.

File details

Details for the file spacyfishing-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spacyfishing-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a4c8f4cf8da2d0bd49211d10d489d91222f6688a74cfee078756bb0f87c9883
MD5 c82e04a3cc0e17ff544d7a831c5a85d2
BLAKE2b-256 82434a9b6ce3fe12ff0107df8b6551721c67d9b4e42e9f6c3bf1b688fb1e9594

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page