Skip to main content

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Project description

spaCyOpenTapioca

PyPI version

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata.

Table of contents

Installation

pip install spacyopentapioca

or

git clone https://github.com/UB-Mannheim/spacyopentapioca
cd spacyopentapioca/
pip install .

How to use

After installation the OpenTapioca pipeline can be used without any other pipelines:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Germany.")
for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)

The types and aliases are also available:

for span in doc.ents:
    print((span._.types, span._.aliases[0:5]))
({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}, ['IJalimani', 'R. F. A.', 'Alemania', '도이칠란트', 'Germaniya'])

The Wikidata QIDs are attached to tokens:

for token in doc:
    print((token.text, token.ent_kb_id_))
('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Germany', 'Q183')
('.', '')

The raw response of the OpenTapioca API can be accessed in the doc- and span-objects:

raw_annotations1 = doc._.annotations
raw_annotations2 = [span._.annotations for span in doc.ents]

The partial metadata for the response returned by the OpenTapioca API is

doc._.metadata

All span-extensions are:

span._.annotations
span._.description
span._.aliases
span._.rank
span._.score
span._.types
span._.label
span._.extra_aliases
span._.nb_sitelinks
span._.nb_statements

Note that spaCyOpenTapioca does a tiny processing of entities appearing in doc.ents. All entities returned by OpenTapioca can be found in doc.spans['all_entities_opentapioca'].

Batching

Batched asynchronous requests to the OpenTapioca API via nlp.pipe(List[str]):

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
docs = nlp.pipe(
    [
        "Christian Drosten works in Germany.",
        "Momofuku Ando was born in Japan.".
    ]
)
for doc in docs:
    for span in doc.ents:
        print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)
('Momofuku Ando', 'Q317858', 'PERSON', 'Taiwanese-Japanese businessman', 3.6012208212234302)
('Japan', 'Q17', 'LOC', 'sovereign state in East Asia, situated on an archipelago of five main and over 6,800 smaller islands', 2.349944834167907)

Local OpenTapioca

If OpenTapioca is deployed locally, specify the URL of the new OpenTapioca API in the config:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca', config={"url": OpenTapiocaAPI})
doc = nlp("Christian Drosten works in Germany.")

Vizualization

NEL vizualization is added to spaCy via pull request 9199 for issue 9129. It is supported by spaCy >= 3.1.4.

Use manual option in displaCy:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works\n in Charité, Germany.")
params = {"text": doc.text,
          "ents": [{"start": ent.start_char,
                    "end": ent.end_char,
                    "label": ent.label_,
                    "kb_id": ent.kb_id_,
                    "kb_url": "https://www.wikidata.org/entity/" + ent.kb_id_}
                   for ent in doc.ents],
          "title": None}
spacy.displacy.serve(params, style="ent", manual=True)

The visualizer is serving on http://0.0.0.0:5000

alt text

In Jupyter Notebook replace spacy.displacy.serve by spacy.displacy.render.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacyopentapioca-0.1.7.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

spacyopentapioca-0.1.7-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file spacyopentapioca-0.1.7.tar.gz.

File metadata

  • Download URL: spacyopentapioca-0.1.7.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.6

File hashes

Hashes for spacyopentapioca-0.1.7.tar.gz
Algorithm Hash digest
SHA256 4c3ee8030726cfe59fef1a18a048d2c0c065d2aac290b679a50c29121e58de27
MD5 69e919fb222e7903df56b706379159c5
BLAKE2b-256 ffd8475da22f5ecc16c097ba302b81e9cbfdb17bfaf35c8a90a58b058d86b07a

See more details on using hashes here.

File details

Details for the file spacyopentapioca-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for spacyopentapioca-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5ff6a48bf291b272e190c882f49af8a9e453f547b85d82d1475d548e127c3877
MD5 fe35def3791e55a3a729efbf5d96b8e4
BLAKE2b-256 143913a64881335a24c1a89212d1bc56e8cd04e4d0c0f0b49fdd9c7cf03d2eed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page