SpaCy DBpedia Spotlight wrapper
Project description
Spacy DBpedia Spotlight
This package acts as a Entity Recogniser and Linker using DBpedia Spotlight, annotating SpaCy's Spans and adding them to the entities annotations.
It can be added to an existing spaCy Language object, or create a new one from an empty pipeline.
The results are put in doc.ents
, overwriting existing entities in case of conflict depending on the overwrite_ents
parameter.
The spans produced have the following properties:
span.label_ = 'DBPEDIA_ENT'
span.ent_kb_id_
containing the URI of the linked entityspan._.dbpedia_raw_result
containing the raw json for the entity from DBpedia spotlight (@URI
,@support
,@types
,@surfaceForm
,@offset
,@similarityScore
,@percentageOfSecondRank
)
Usage
Installation
This package works with SpaCy v3
With pip: pip install spacy-dbpedia-spotlight
From GitHub (after clone): pip install .
Instantiating the pipeline component
With a blank new language
import spacy_dbpedia_spotlight
# a new blank model will be created, with the language code provided in the parameter
nlp = spacy_dbpedia_spotlight.create('en')
# in this case, the pipeline will be only contain the EntityLinker
print(nlp.pipe_names)
# ['dbpedia_spotlight']
On top of an existing nlp object (added as last pipeline stage by default)
import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')
# see the pipeline, the added stage is at the end
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'dbpedia_spotlight']
The pipeline stage can be added at any point of an existing pipeline (using the arguments before
, after
, first
or last
)
import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight', first=True)
# see the pipeline, the added stage is at the beginning
print(nlp.pipe_names)
# ['dbpedia_spotlight', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
Configuration parameters
This component can be used with the following parameters:
language_code
: to explicitly use a specific dbpedia language, because on default the value fromnlp.meta['lang']
is useddbpedia_rest_endpoint
: to use something different fromhttp://api.dbpedia-spotlight.org/{LANGUAGE_CODE}
, for example when using a local instance of DBpedia Spotlight. Don't set it if the default location is okoverwrite_ents
: to control how the overwriting ofdoc.ents
is performed, because other components may have already written there (e.g., theen_core_web_lg
model has aner
pipeline component which already sets some entities). The component tries to add the new ones from DBpedia, which can be successful if the entities do not overlap in terms of tokens. The cases are the following:- no tokens overlap between the pre-exisiting
doc.ents
and the new entities: in this casedoc.ents
will contain both the previous entities and the new entities - some tokens overlap and
overwrite_ents=True
: the previous value ofdoc.ents
is saved indoc.spans['ents_original']
and only the dbpedia entities will be saved indoc.ents
- some tokens overlap and
overwrite_ents=False
: the previous value ofdoc.ents
is left untouched, and the dbpedia entiities can be found indoc.spans['dbpedia_ents']
- no tokens overlap between the pre-exisiting
The configuration dict needs to be passed when instantiating the pipeline component
import spacy
nlp = spacy.load('en_core_web_lg')
# instantiate Spanish EntityLinker on the English model
nlp.add_pipe('dbpedia_spotlight', config={'language_code': 'es'})
Using the model
After having instantiated the component, you can use the spaCy API as usual
doc = nlp('The president of USA is calling Boris Johnson to decide what to do about coronavirus')
print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
Output example:
Entities [('USA', 'DBPEDIA_ENT', 'http://dbpedia.org/resource/United_States'), ('Boris Johnson', 'DBPEDIA_ENT', 'http://dbpedia.org/resource/Boris_Johnson'), ('coronavirus', 'DBPEDIA_ENT', 'http://dbpedia.org/resource/Coronavirus')]
Common issues
DBpedia refuses to answer huge quantities of requests
After a few requests to DBpedia spotlight, the public web service will reply with some bad HTTP codes.
The solution is to use a local DBpedia instance. The instructions below are with Docker or without it.
Deploy with Docker
# pull the official image
docker pull dbpedia/dbpedia-spotlight
# create a volume for persistently saving the language models
docker volume create spotlight-models
# start the container (here assuming we want the en model)
docker run -ti \
--restart unless-stopped \
--name dbpedia-spotlight.en \
--mount source=spotlight-models,target=/opt/spotlight \
-p 2222:80 \
dbpedia/dbpedia-spotlight \
spotlight.sh en
Withouth Docker
# download main jar
wget https://sourceforge.net/projects/dbpedia-spotlight/files/spotlight/dbpedia-spotlight-1.0.0.jar
# download latest model (assuming en model)
wget -O en.tar.gz http://downloads.dbpedia.org/repo/dbpedia/spotlight/spotlight-model/2020.11.18/spotlight-model_lang%3den.tar.gz
# extract model
tar xzf en.tar.gz
# run server
java -jar dbpedia-spotlight-1.0.0.jar en http://localhost:2222/rest
Use the local server
First of all, make sure that the local server is working.
curl http://localhost:2222/rest/annotate \
--data-urlencode "text=President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance." \
--data "confidence=0.35" \
-H "Accept: text/turtle"
Then in Python you can configure the endpoint in the following way
import spacy
nlp = spacy.load('en_core_web_lg')
# Use your endpoint: don't put any trailing slashes, and don't include the /annotate path
nlp.add_pipe('dbpedia_spotlight', config={'dbpedia_rest_endpoint': 'http://localhost:2222/rest'})
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for spacy_dbpedia_spotlight-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57ab226b96b0d3d915d01755322717089fffcffdce76bd59bd6cd2758a456588 |
|
MD5 | ef0504a2d12eaa339f0168a75d784dba |
|
BLAKE2b-256 | 1e28de97571242bb2cc57627cddd4d95efc8ab0f824e47c9d57c7dee56c95760 |