Accurately remove and replace emojis in text strings
Project description
demoji_lambda
Forked from https://github.com/bsolomon1124/demoji
Accurately find or remove emojis from a blob of text.
Basic Usage
demoji
requires an initial data download from the Unicode Consortium's emoji code repository.
On first use of the package, call download_codes()
:
>>> import demoji
>>> demoji.download_codes()
Downloading emoji data ...
... OK (Got response in 0.14 seconds)
Writing emoji data to /Users/brad/.demoji/codes.json ...
... OK
This will store the Unicode hex-notated symbols at ~/.demoji/codes.json
for future use.
demoji
exports several text-related functions for find-and-replace functionality with emojis:
>>> tweet = """\
... #startspreadingthenews yankees win great start by 🎅🏾 going 5strong innings with 5k’s🔥 🐂
... solo homerun 🌋🌋 with 2 solo homeruns and👹 3run homerun… 🤡 🚣🏼 👨🏽⚖️ with rbi’s … 🔥🔥
... 🇲🇽 and 🇳🇮 to close the game🔥🔥!!!….
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
"🔥": "fire",
"🌋": "volcano",
"👨🏽\u200d⚖️": "man judge: medium skin tone",
"🎅🏾": "Santa Claus: medium-dark skin tone",
"🇲🇽": "flag: Mexico",
"👹": "ogre",
"🤡": "clown face",
"🇳🇮": "flag: Nicaragua",
"🚣🏼": "person rowing boat: medium-light skin tone",
"🐂": "ox",
}
See below for function API.
The reason that demoji
requires a download rather than coming pre-packaged with Unicode emoji data is that the emoji list itself is frequently updated and changed. You are free to periodically update the local cache by calling demoji.download_codes()
every so often.
To pull your last-downloaded date, you can use the last_downloaded_timestamp()
helper:
>>> demoji.last_downloaded_timestamp()
datetime.datetime(2019, 2, 9, 7, 42, 24, 433776, tzinfo=<demoji.UTC object at 0x101b9ecf8>)
The result will be None
if codes have not previously been downloaded.
Reference
Note: Text
refers to typing.Text
, an alias for str
in Python 3 or unicode
in Python 2.
download_codes() -> None
Download emoji data to ~/.demoji/codes.json. Required at first module usage, and can be used to periodically update data.
findall(string: Text) -> Dict[Text, Text]
Find emojis within string
. Return a mapping of {emoji: description}
.
findall_list(string: Text, desc: bool = True) -> List[Text]
Find emojis within string
. Return a list (with possible duplicates).
If desc
is True, the list contains description codes. If desc
is False, the list contains emojis.
replace(string: Text, repl: Text = "") -> Text
Replace emojis in string
with repl
.
replace_with_desc(string: Text, sep: Text = ":") -> Text
Replace emojis in string
with their description codes. The codes are surrounded by sep
.
last_downloaded_timestamp() -> Optional[datetime.datetime]
Show the timestamp of last download from download_codes()
.
Footnote: Emoji Sequences
Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:
- The keycap 2️⃣ is actually 3 characters, U+0032 (the ASCII digit 2), U+FE0F (variation selector), and U+20E3 (combining enclosing keycap).
- The flag of Scotland 7 component characters,
b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f'
in full esaped notation.
(You can see any of these through s.encode("unicode-escape")
.)
demoji
is careful to handle this and should find the full sequences rather than their incomplete subcomponents.
The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.
>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that 🙋, 🙋♂️, and 🙋♀️ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape')) # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')
Changelog
0.4.0
- Update emoji source list to version 13.1. (See 5090eb5.)
- Formally support Python 3.9. (See 6e9c34c.)
- Bugfix: ensure that
demoji.last_downloaded_timestamp()
returns correct UTC time. (See 6c8ad15.)
0.3.0
- Feature: add
findall_list()
andreplace_with_desc()
functions. (See 7cea333.) - Modernize setup config to use
setup.cfg
. (See 8f141e7.)
0.2.1
- Tox: formally add Python 3.8 tests.
0.2.0
- Windows: use the colorama package to support printing ANSI escape sequences on Windows; this introduces colorama as a dependency. (See cd343c1.)
- Setup: Fix a bug in
setup.py
that would require dependencies to be installed prior to installation ofdemoji
in order to find the__version__
. (See d5f429c.) - Python 2 + Windows support: use
io.open(..., encoding='utf-8')
consistently insetup.py
. (See 1efec5d.) - Distribution: use a universal wheel in PyPI release. (See 8636a32.)
0.1.5
- Performance improvement: use
re.escape()
rather than failing to compile a small subset of codes. - Remove an unused constant in
__init__.py
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.