A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.
Project description
spaCy fishing
A spaCy wrapper of Entity-Fishing, a tool for named entity disambiguation and linking on Wikidata.
This extension allows using Entity-Fishing tool as a spaCy pipeline component to disambiguate and link named entities (with custom or pretrained NER spaCy models) to Wikidata knowledge base.
Table of contents
- Installation
- Usage (examples)
- Configuration parameters
- Attributes
- Recommendations
- Visualise results
- External Ressources
Installation
normal
pip install spacyfishing
development
git clone https://github.com/Lucaterre/spacyfishing.git
virtualvenv --python=/usr/bin/python3.8 venv
source venv/bin/activate
pip install -r requirements.txt
Usage
First, install a pre-trained spaCy language model for the NER task:
python -m spacy download en_core_web_sm
Note that it is possible to use custom NER model.
Simple example
import spacy
text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."
nlp_model_en = spacy.load("en_core_web_sm")
nlp_model_en.add_pipe("entityfishing")
doc_en = nlp_model_en(text_en)
for ent in doc_en.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)
Get Wikidata description
By default, the Wikidata entity description is not provided, to access it,
specify True
in the description_required
parameter in the component configuration:
import spacy
text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."
nlp_model_en = spacy.load("en_core_web_sm")
# specify configuration:
nlp_model_en.add_pipe("entityfishing", config={"description_required": True})
doc_en = nlp_model_en(text_en)
# Access to description with ent._.description:
for ent in doc_en.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.description))
('Victor Hugo', 'PERSON', 'Q535', "'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').")
('Honoré de Balzac', 'PERSON', 'Q9711', "'''Honoré de Balzac''' (;, born '''Honoré Balzac''', 20 May 1799 \xa0 – 18 August 1850) was a French novelist and playwright. The [[novel sequence]] ''[[La Comédie Humaine]]'', which presents a panorama of [[Napoleonic Era|post-Napoleonic]] French life, is generally viewed as his ''[[Masterpiece|magnum opus]]''.")
('French', 'NORP', 'Q121842', "The '''French''' are an [[ethnic group]] and [[nation]] who are identified with the country of [[France]]. This connection may be legal, historical, or cultural.")
('Paris', 'GPE', 'Q90', "'''Paris''' is the [[Capital city|capital]] and most populous [[city]] of [[France]], with an administrative-limits area of and a 2015 population of 2,229,621. The city is a [[Communes of France|commune]] and [[Departments of France|department]], and the capital-heart of the [[Île-de-France]] ''[[Region in France|region]]'' (colloquially known as the 'Paris Region'), whose 12,142,802 2016 population represents roughly 18 percent of the population of France. By the 17th century, Paris had become one of Europe's major centres of finance, commerce, fashion, science, and the arts, a position that it retains still today. The Paris Region had a [[GDP]] of €649.6 billion (US $763.4 billion) in 2014, accounting for 30.4 percent of the GDP of France. According to official estimates, in 2013-14 the Paris Region had [[List of cities by GDP|the third-highest GDP in the world and the largest regional GDP in the EU]].")
Use other languages
By default, disambiguation model resources are set to English, to use other languages, specify the language code in the language
parameter in the component configuration:
import spacy
text_fr = "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel."
nlp_model_fr = spacy.load("fr_core_news_sm")
nlp_model_fr.add_pipe("entityfishing", config={"language": "fr"})
doc_fr = nlp_model_fr(text_fr)
for ent in doc_fr.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata))
("bataille d'El-Alamein", 'MISC', 'Q153376', 'https://www.wikidata.org/wiki/Q153376')
('Égypte', 'LOC', 'Q79', 'https://www.wikidata.org/wiki/Q79')
('Bernard Montgomery', 'PER', 'Q152025', 'https://www.wikidata.org/wiki/Q152025')
('Erwin Rommel', 'PER', 'Q14060', 'https://www.wikidata.org/wiki/Q14060')
To consult the languages available in Entity-Fishing follow this link.
Get information about Entity-fishing API response
The raw response of the Entity-fishing API can be accessed in the doc object:
doc._.annotations
{'runtime': 18,
'nbest': False,
'text': "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel.",
'language': {'lang': 'fr', 'conf': 0.0},
'global_categories': [
{'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': 'Général allemand', 'page_id': 94081},
{'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': "Chevalier grand-croix de l'ordre du Bain", 'page_id': 3399090},
{'weight': 0.01960784313725492, 'source': 'wikipedia-fr', 'category': 'Naissance à Heidenheim an der Brenz', 'page_id': 8396804}
],
'entities': [
{'rawName': "bataille d'El-Alamein", 'offsetStart': 1, 'offsetEnd': 6, 'nerd_score': 1, 'nerd_selection_score': 0.9198, 'wikipediaExternalRef': 261761, 'wikidataId': 'Q153376', 'domains': ['Military']},
{'rawName': 'Égypte', 'offsetStart': 7, 'offsetEnd': 8, 'nerd_score': 1, 'nerd_selection_score': 0.6437, 'wikipediaExternalRef': 4011, 'wikidataId': 'Q79', 'domains': ['Geology']},
{'rawName': 'Bernard Montgomery', 'offsetStart': 15, 'offsetEnd': 17, 'nerd_score': 1, 'nerd_selection_score': 0.9965, 'wikipediaExternalRef': 46225, 'wikidataId': 'Q152025', 'domains': ['Biology', 'Military']},
{'rawName': 'Erwin Rommel', 'offsetStart': 20, 'offsetEnd': 22, 'nerd_score': 1, 'nerd_selection_score': 0.9955, 'wikipediaExternalRef': 46221, 'wikidataId': 'Q14060', 'domains': ['Military']}
]
}
To access to query metadata for the response by Entity-Fishing API :
doc._.metadata
{'status_code': 200, 'reason': 'OK', 'ok': True, 'encoding': 'utf-8'}
Configuration parameters
- api_ef_base : URL of the entity-fishing API endpoint. Defaults to Huma-Num server.
- language : Specify language of KB ressources for entity-fishing API. Defaults to "en".
- description_required : Get Wikidata description from service "concept look-up" of entity-fishing API. Defaults to false.
Attributes
-
Doc extensions:
doc._.annotations : Raw response from Entity-Fishing API. doc._.metadata : Raw information about request and response from Entity-Fishing API.
-
Span extensions:
span._.kb_qid : Wikidata identifier (QID). span._.description : Short Wikidata description. span._.url_wikidata : URL to Wikidata ressource. span._.nerd_score : Selection confidence score for the disambiguated entity.
For details on Entity-Fishing response, follow this link
Recommendations
If Entity-fishing is deployed locally or on specific server, specify the URL of the new Entity-fishing API endpoint in the config:
nlp.add_pipe("entityfishing", config={"api_ef_base": "<api-endpoint>"})
This can be useful, if you work with more recent dumps of Wikidata knowledge base or increase speed in the pipeline.
To install your own instance of Entity-Fishing, follow this link.
Visualise results
To visualize the named entities and their wikidata links, use the manual option of displaCy:
import spacy
text_fr = "La bataille d'El-Alamein en Égypte oppose la 8e armée britannique dirigée par Bernard Montgomery aux divisions d'Erwin Rommel."
nlp_model_fr = spacy.load("fr_core_news_sm")
nlp_model_fr.add_pipe("entityfishing", config={"language": "fr"})
doc_fr = nlp_model_fr(text_fr)
options = {
"ents": ["MISC", "LOC", "PER"],
"colors": {"LOC": "#82e0aa", "PER": "#85c1e9", "MISC": "#f0b27a"}
}
params = {"text": doc_fr.text,
"ents": [{"start": ent.start_char,
"end": ent.end_char,
"label": ent.label_,
"kb_id": ent._.kb_qid,
"kb_url": ent._.url_wikidata}
for ent in doc_fr.ents],
"title": None}
spacy.displacy.serve(params, style="ent", manual=True, options=options)
The visualizer is serving on http://0.0.0.0:5000
External Ressources
Specifications
This component is experimental, it may be used for research, and it may evolve in the future.
The structure component was inspired by another spacy extension called spaCyOpenTapioca.
Entity-Fishing is tool created by Patrice Lopez (SCIENCE-MINER), with contributions of Inria Paris, and distributed under Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacyfishing-0.1.0b0.tar.gz
.
File metadata
- Download URL: spacyfishing-0.1.0b0.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae18b848e0144983002ce4c808c334a6e12231479fd38dac5dcbf3a3215fc824 |
|
MD5 | 490c89a9abfe290cf8d4a52776f12d90 |
|
BLAKE2b-256 | 386d476aeb47b06ee0914038d9db0fc691b8fb02f4f4b8f965af37b79e24fb30 |
File details
Details for the file spacyfishing-0.1.0b0-py3-none-any.whl
.
File metadata
- Download URL: spacyfishing-0.1.0b0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe2b1893149a7b8a1258d8082eeb235b1783c5dccabfad8eb80798776a0dbaec |
|
MD5 | da8c709a19a60896acc81b89217b486d |
|
BLAKE2b-256 | aa78a873fcea0a488efd32ba78ae07d0354c5938488b581507e6bdd450c0cdd8 |