A spaCy wrapper of OpenTapioca for named entity linking on Wikidata
Project description
spaCyOpenTapioca
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata.
Table of contents
Installation
pip install spacyopentapioca
or
git clone https://github.com/UB-Mannheim/spacyopentapioca
cd spacyopentapioca/
pip install .
How to use
After installation the OpenTapioca pipeline can be used without any other pipelines:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Germany.")
for span in doc.ents:
print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)
The types and aliases are also available:
for span in doc.ents:
print((span._.types, span._.aliases[0:5]))
({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}, ['IJalimani', 'R. F. A.', 'Alemania', '도이칠란트', 'Germaniya'])
The Wikidata QIDs are attached to tokens:
for token in doc:
print((token.text, token.ent_kb_id_))
('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Germany', 'Q183')
('.', '')
The raw response of the OpenTapioca API can be accessed in the doc- and span-objects:
raw_annotations1 = doc._.annotations
raw_annotations2 = [span._.annotations for span in doc.ents]
The partial metadata for the response returned by the OpenTapioca API is
doc._.metadata
All span-extensions are:
span._.annotations
span._.description
span._.aliases
span._.rank
span._.score
span._.types
span._.label
span._.extra_aliases
span._.nb_sitelinks
span._.nb_statements
Note that spaCyOpenTapioca does a tiny processing of entities appearing in doc.ents
. All entities returned by OpenTapioca can be found in doc.spans['all_entities_opentapioca']
.
Local OpenTapioca
If OpenTapioca is deployed locally, specify the URL of the new OpenTapioca API in the config:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca', config={"url": OpenTapiocaAPI})
doc = nlp("Christian Drosten works in Germany.")
Vizualization
NEL vizualization is added to spaCy via pull request 9199 for issue 9129. It is supported by spaCy >= 3.1.4.
Use manual option in displaCy:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works\n in Charité, Germany.")
params = {"text": doc.text,
"ents": [{"start": ent.start_char,
"end": ent.end_char,
"label": ent.label_,
"kb_id": ent.kb_id_,
"kb_url": "https://www.wikidata.org/entity/" + ent.kb_id_}
for ent in doc.ents],
"title": None}
spacy.displacy.serve(params, style="ent", manual=True)
The visualizer is serving on http://0.0.0.0:5000
In Jupyter Notebook replace spacy.displacy.serve
by spacy.displacy.render
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacyopentapioca-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c6a2e4eb156c575eb32028c284b9d8001e7392c0d1fa895c39e0493bf3b0618 |
|
MD5 | 1c996d4760b7de8b60f507c71d28bf13 |
|
BLAKE2b-256 | 3ed6aedf789a3f598a0f1ef3be625edd453cc7b103d560b14162412f5fd38a76 |