spaCy pipeline component for Named Entity Recognition based on dictionaries.
Project description
spacy-lookup: Named Entity Recognition based on dictionaries
************************************************************
`spaCy v2.0 <https://spacy.io/usage/v2>`_ extension and pipeline component
for adding Named Entities metadata to ``Doc`` objects. Detects Named Entities
using dictionaries. The extension sets the custom ``Doc``,
``Token`` and ``Span`` attributes ``._.is_entity``, ``._.entity_type``,
``._.has_entities`` and ``._.entities``.
Named Entities are matched using the python module ``flashtext``, and
looks up in the data provided by different dictionaries.
Installation
===============
``spacy-lookup`` requires ``spacy`` v2.0.16 or higher.
.. code:: bash
pip install spacy-lookup
Usage
=====
First, you need to download a language model.
.. code:: bash
python -m spacy download en
Import the component and initialise it with the shared ``nlp`` object (i.e. an
instance of ``Language``), which is used to initialise ``flashtext``
with the shared vocab, and create the match patterns. Then add the component
anywhere in your pipeline.
.. code:: python
import spacy
from spacy_lookup import Entity
nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)
doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True
print([(token.text, token._.canonical) for token in doc if token._.is_entity])
``spacy-lookup`` only cares about the token text, so you can use it on a blank
``Language`` instance (it should work for all
`available languages <https://spacy.io/usage/models#languages>`_!), or in
a pipeline with a loaded model. If you're loading a model and your pipeline
includes a tagger, parser and entity recognizer, make sure to add the entity
component as ``last=True``, so the spans are merged at the end of the pipeline.
Available attributes
--------------------
The extension sets attributes on the ``Doc``, ``Span`` and ``Token``. You can
change the attribute names on initialisation of the extension. For more details
on custom components and attributes, see the
`processing pipelines documentation <https://spacy.io/usage/processing-pipelines#custom-components>`_.
====================== ======= ===
``Token._.is_entity`` bool Whether the token is an entity.
``Token._.entity_type`` unicode A human-readable description of the entity.
``Doc._.has_entities`` bool Whether the document contains entity.
``Doc._.entities`` list ``(entity, index, description)`` tuples of the document's entities.
``Span._.has_entities`` bool Whether the span contains entity.
``Span._.entities`` list ``(entity, index, description)`` tuples of the span's entities.
====================== ======= ===
Settings
--------
On initialisation of ``Entity``, you can define the following settings:
=============== ============ ===
``nlp`` ``Language`` The shared ``nlp`` object. Used to initialise the matcher with the shared ``Vocab``, and create ``Doc`` match patterns.
``attrs`` tuple Attributes to set on the ._ property. Defaults to ``('has_entities', 'is_entity', 'entity_type', 'entity')``.
``keywords_list`` list Optional lookup table with the list of terms to look for.
``keywords_dict`` dict Optional lookup table with the list of terms to look for.
``keywords_file`` string Optional filename with the list of terms to look for.
=============== ============ ===
.. code:: python
entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity
************************************************************
`spaCy v2.0 <https://spacy.io/usage/v2>`_ extension and pipeline component
for adding Named Entities metadata to ``Doc`` objects. Detects Named Entities
using dictionaries. The extension sets the custom ``Doc``,
``Token`` and ``Span`` attributes ``._.is_entity``, ``._.entity_type``,
``._.has_entities`` and ``._.entities``.
Named Entities are matched using the python module ``flashtext``, and
looks up in the data provided by different dictionaries.
Installation
===============
``spacy-lookup`` requires ``spacy`` v2.0.16 or higher.
.. code:: bash
pip install spacy-lookup
Usage
=====
First, you need to download a language model.
.. code:: bash
python -m spacy download en
Import the component and initialise it with the shared ``nlp`` object (i.e. an
instance of ``Language``), which is used to initialise ``flashtext``
with the shared vocab, and create the match patterns. Then add the component
anywhere in your pipeline.
.. code:: python
import spacy
from spacy_lookup import Entity
nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)
doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True
print([(token.text, token._.canonical) for token in doc if token._.is_entity])
``spacy-lookup`` only cares about the token text, so you can use it on a blank
``Language`` instance (it should work for all
`available languages <https://spacy.io/usage/models#languages>`_!), or in
a pipeline with a loaded model. If you're loading a model and your pipeline
includes a tagger, parser and entity recognizer, make sure to add the entity
component as ``last=True``, so the spans are merged at the end of the pipeline.
Available attributes
--------------------
The extension sets attributes on the ``Doc``, ``Span`` and ``Token``. You can
change the attribute names on initialisation of the extension. For more details
on custom components and attributes, see the
`processing pipelines documentation <https://spacy.io/usage/processing-pipelines#custom-components>`_.
====================== ======= ===
``Token._.is_entity`` bool Whether the token is an entity.
``Token._.entity_type`` unicode A human-readable description of the entity.
``Doc._.has_entities`` bool Whether the document contains entity.
``Doc._.entities`` list ``(entity, index, description)`` tuples of the document's entities.
``Span._.has_entities`` bool Whether the span contains entity.
``Span._.entities`` list ``(entity, index, description)`` tuples of the span's entities.
====================== ======= ===
Settings
--------
On initialisation of ``Entity``, you can define the following settings:
=============== ============ ===
``nlp`` ``Language`` The shared ``nlp`` object. Used to initialise the matcher with the shared ``Vocab``, and create ``Doc`` match patterns.
``attrs`` tuple Attributes to set on the ._ property. Defaults to ``('has_entities', 'is_entity', 'entity_type', 'entity')``.
``keywords_list`` list Optional lookup table with the list of terms to look for.
``keywords_dict`` dict Optional lookup table with the list of terms to look for.
``keywords_file`` string Optional filename with the list of terms to look for.
=============== ============ ===
.. code:: python
entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Built Distribution
Close
Hashes for spacy_lookup-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9efd5131d470adbf23ada86fc0e48b9c8874c8e0e171d63c933dcb207fadcc6 |
|
MD5 | 3b672137292827b56cea6e387808640a |
|
BLAKE2-256 | d2a97bddf1c5c0a717508cbc3e47fbf0e5c807ca46232bea46765a2619adafe2 |