spaCy pipeline component for adding emoji metadata to Doc, Token and Span objects
Project description
spacymoji: emoji for spaCy
spaCy extension and pipeline component for adding emoji meta
data to Doc
objects. Detects emoji consisting of one or more unicode
characters, and can optionally merge multi-char emoji (combined pictures, emoji
with skin tone modifiers) into one token. Human-readable emoji descriptions are
added as a custom attribute, and an optional lookup table can be provided for
your own descriptions. The extension sets the custom Doc
, Token
and Span
attributes ._.is_emoji
, ._.emoji_desc
, ._.has_emoji
and ._.emoji
. You
can read more about custom pipeline components and extension attributes
here.
Emoji are matched using spaCy's
PhraseMatcher
, and looked up in the data
table provided by the emoji
package.
⏳ Installation
spacymoji
requires spacy
v3.0.0 or higher. For spaCy v2.x, install
spacymoji==2.0.0
.
pip install spacymoji
☝️ Usage
Import the component and add it anywhere in your pipeline using the string name
of the "emoji"
component factory:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)
doc = nlp("This is a test 😻 👍🏿")
assert doc._.has_emoji is True
assert doc[2:5]._.has_emoji is True
assert doc[0]._.is_emoji is False
assert doc[4]._.is_emoji is True
assert doc[5]._.emoji_desc == "thumbs up dark skin tone"
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == ("👍🏿", 5, "thumbs up dark skin tone")
spacymoji
only cares about the token text, so you can use it on a blank
Language
instance (it should work for all
available languages!), or in a
pipeline with a loaded pipeline. If your pipeline includes a tagger, parser and
entity recognizer, make sure to add the emoji component as first=True
, so the
spans are merged right after tokenization, and before the document is parsed.
If your text contains a lot of emoji, this might even give you a nice boost in
parser accuracy.
Available attributes
The extension sets attributes on the Doc
, Span
and Token
. You can change
the attribute names (and other parameters of the Emoji component) by passing
them via the config
parameter in the nlp.add_pipe(...)
method. For more
details on custom components and attributes, see the
processing pipelines documentation.
Attribute | Type | Description |
---|---|---|
Token._.is_emoji |
bool | Whether the token is an emoji. |
Token._.emoji_desc |
str | A human-readable description of the emoji. |
Doc._.has_emoji |
bool | Whether the document contains emoji. |
Doc._.emoji |
List[Tuple[str, int, str]] | (emoji, index, description) tuples of the document's emoji. |
Span._.has_emoji |
bool | Whether the span contains emoji. |
Span._.emoji |
List[Tuple[str, int, str]] | (emoji, index, description) tuples of the span's emoji. |
Settings
You can configure the emoji
factory by setting any of the following parameters
in the config
dictionary:
Setting | Type | Description |
---|---|---|
attrs |
Tuple[str, str, str, str] | Attributes to set on the ._ property. Defaults to ('has_emoji', 'is_emoji', 'emoji_desc', 'emoji') . |
pattern_id |
str | ID of match pattern, defaults to 'EMOJI' . Can be changed to avoid ID conflicts. |
merge_spans |
bool | Merge spans containing multi-character emoji, defaults to True . Will only merge combined emoji resulting in one icon, not sequences. |
lookup |
Dict[str, str] | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations. |
emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"👨🎤": "David Bowie"})
nlp.add_pipe(emoji, first=True, config=emoji_config)
doc = nlp("We can be 👨🎤 heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == "David Bowie"
If you're training a pipeline, you can define the component config in your
config.cfg
:
[nlp]
pipeline = ["emoji", "ner"]
# ...
[components.emoji]
factory = "emoji"
merge_spans = false
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacymoji-3.1.0.tar.gz
.
File metadata
- Download URL: spacymoji-3.1.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55f171fd88bb1131ea7dd19754541c3f9206b19d608ed965b5f95e1e81107e94 |
|
MD5 | da4cff8205125923f6006be335acb79b |
|
BLAKE2b-256 | ef25fc60fecc03e34078f32402694139bab644e6f64a45341a3270539a93bf8b |
File details
Details for the file spacymoji-3.1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: spacymoji-3.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 443df056e4bf23afb1f6ff8a372d9088e02d5eb2bd4a37a51fa0d19c35d0312b |
|
MD5 | 279745c4d6abdc0aebd70641e7c5c687 |
|
BLAKE2b-256 | 3c5dcf1f18f9c3a88fc2cd51aad40f7bfeb9657d3c2c937ff950ede3e6029079 |