spaCy pipeline component for Named Entity Recognition based on dictionaries.
Project description
spacy-lookup: Named Entity Recognition based on dictionaries
************************************************************
`spaCy v2.0 <https://spacy.io/usage/v2>`_ extension and pipeline component
for adding Named Entities metadata to ``Doc`` objects. Detects Named Entities
using dictionaries. The extension sets the custom ``Doc``,
``Token`` and ``Span`` attributes ``._.is_entity``, ``._.entity_type``,
``._.has_entities`` and ``._.entities``.
Named Entities are matched using the python module ``flashtext``, and
looks up in the data provided by different dictionaries.
Installation
===============
``spacy-lookup`` requires ``spacy`` v2.0.16 or higher.
.. code:: bash
pip install spacy-lookup
Usage
=====
First, you need to download a language model.
.. code:: bash
python -m spacy download en
Import the component and initialise it with the shared ``nlp`` object (i.e. an
instance of ``Language``), which is used to initialise ``flashtext``
with the shared vocab, and create the match patterns. Then add the component
anywhere in your pipeline.
.. code:: python
import spacy
from spacy_lookup import Entity
nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)
doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True
print([(token.text, token._.canonical) for token in doc if token._.is_entity])
``spacy-lookup`` only cares about the token text, so you can use it on a blank
``Language`` instance (it should work for all
`available languages <https://spacy.io/usage/models#languages>`_!), or in
a pipeline with a loaded model. If you're loading a model and your pipeline
includes a tagger, parser and entity recognizer, make sure to add the entity
component as ``last=True``, so the spans are merged at the end of the pipeline.
Available attributes
--------------------
The extension sets attributes on the ``Doc``, ``Span`` and ``Token``. You can
change the attribute names on initialisation of the extension. For more details
on custom components and attributes, see the
`processing pipelines documentation <https://spacy.io/usage/processing-pipelines#custom-components>`_.
====================== ======= ===
``Token._.is_entity`` bool Whether the token is an entity.
``Token._.entity_type`` unicode A human-readable description of the entity.
``Doc._.has_entities`` bool Whether the document contains entity.
``Doc._.entities`` list ``(entity, index, description)`` tuples of the document's entities.
``Span._.has_entities`` bool Whether the span contains entity.
``Span._.entities`` list ``(entity, index, description)`` tuples of the span's entities.
====================== ======= ===
Settings
--------
On initialisation of ``Entity``, you can define the following settings:
=============== ============ ===
``nlp`` ``Language`` The shared ``nlp`` object. Used to initialise the matcher with the shared ``Vocab``, and create ``Doc`` match patterns.
``attrs`` tuple Attributes to set on the ._ property. Defaults to ``('has_entities', 'is_entity', 'entity_type', 'entity')``.
``keywords_list`` list Optional lookup table with the list of terms to look for.
``keywords_dict`` dict Optional lookup table with the list of terms to look for.
``keywords_file`` string Optional filename with the list of terms to look for.
=============== ============ ===
.. code:: python
entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity
************************************************************
`spaCy v2.0 <https://spacy.io/usage/v2>`_ extension and pipeline component
for adding Named Entities metadata to ``Doc`` objects. Detects Named Entities
using dictionaries. The extension sets the custom ``Doc``,
``Token`` and ``Span`` attributes ``._.is_entity``, ``._.entity_type``,
``._.has_entities`` and ``._.entities``.
Named Entities are matched using the python module ``flashtext``, and
looks up in the data provided by different dictionaries.
Installation
===============
``spacy-lookup`` requires ``spacy`` v2.0.16 or higher.
.. code:: bash
pip install spacy-lookup
Usage
=====
First, you need to download a language model.
.. code:: bash
python -m spacy download en
Import the component and initialise it with the shared ``nlp`` object (i.e. an
instance of ``Language``), which is used to initialise ``flashtext``
with the shared vocab, and create the match patterns. Then add the component
anywhere in your pipeline.
.. code:: python
import spacy
from spacy_lookup import Entity
nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)
doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True
print([(token.text, token._.canonical) for token in doc if token._.is_entity])
``spacy-lookup`` only cares about the token text, so you can use it on a blank
``Language`` instance (it should work for all
`available languages <https://spacy.io/usage/models#languages>`_!), or in
a pipeline with a loaded model. If you're loading a model and your pipeline
includes a tagger, parser and entity recognizer, make sure to add the entity
component as ``last=True``, so the spans are merged at the end of the pipeline.
Available attributes
--------------------
The extension sets attributes on the ``Doc``, ``Span`` and ``Token``. You can
change the attribute names on initialisation of the extension. For more details
on custom components and attributes, see the
`processing pipelines documentation <https://spacy.io/usage/processing-pipelines#custom-components>`_.
====================== ======= ===
``Token._.is_entity`` bool Whether the token is an entity.
``Token._.entity_type`` unicode A human-readable description of the entity.
``Doc._.has_entities`` bool Whether the document contains entity.
``Doc._.entities`` list ``(entity, index, description)`` tuples of the document's entities.
``Span._.has_entities`` bool Whether the span contains entity.
``Span._.entities`` list ``(entity, index, description)`` tuples of the span's entities.
====================== ======= ===
Settings
--------
On initialisation of ``Entity``, you can define the following settings:
=============== ============ ===
``nlp`` ``Language`` The shared ``nlp`` object. Used to initialise the matcher with the shared ``Vocab``, and create ``Doc`` match patterns.
``attrs`` tuple Attributes to set on the ._ property. Defaults to ``('has_entities', 'is_entity', 'entity_type', 'entity')``.
``keywords_list`` list Optional lookup table with the list of terms to look for.
``keywords_dict`` dict Optional lookup table with the list of terms to look for.
``keywords_file`` string Optional filename with the list of terms to look for.
=============== ============ ===
.. code:: python
entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file spacy_lookup-0.0.5-py2.py3-none-any.whl
.
File metadata
- Download URL: spacy_lookup-0.0.5-py2.py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f80dd9d59615b15436cdb0f64f25e48517aa27007d8f9289ed3f4cc971f1e43 |
|
MD5 | 3913e87e41ae54404921677486015625 |
|
BLAKE2b-256 | 010c223057d0e83e8bbc1a6fb0cc78c32cb6d7f7f1f3b50c4db16d04e542c92c |