Skip to main content

Accurately remove and replace emojis in text strings.

Project description

demoji

Accurately find or remove emojis from a blob of text.

License PyPI Status Python


Basic Usage

demoji requires an initial data download from the Unicode Consortium's emoji code repository.

On first use of the package, call download_codes():

>>> import demoji
>>> demoji.download_codes()
Downloading emoji data ...
... OK (Got response in 0.14 seconds)
Writing emoji data to /Users/brad/.demoji/codes.json ...
... OK

This will store the Unicode hex-notated symbols at ~/.demoji/codes.json for future use.

demoji exports two text-related functions, findall() and replace(), which behave somewhat the re module's findall() and sub(), respectively. However, findall() returns a dictionary of emojis to their full name (description):

>>> tweet = """\
... #startspreadingthenews yankees win great start by ๐ŸŽ…๐Ÿพ going 5strong innings with 5kโ€™s๐Ÿ”ฅ ๐Ÿ‚
... solo homerun ๐ŸŒ‹๐ŸŒ‹ with 2 solo homeruns and๐Ÿ‘น 3run homerunโ€ฆ ๐Ÿคก ๐Ÿšฃ๐Ÿผ ๐Ÿ‘จ๐Ÿฝโ€โš–๏ธ with rbiโ€™s โ€ฆ ๐Ÿ”ฅ๐Ÿ”ฅ
... ๐Ÿ‡ฒ๐Ÿ‡ฝ and ๐Ÿ‡ณ๐Ÿ‡ฎ to close the game๐Ÿ”ฅ๐Ÿ”ฅ!!!โ€ฆ.
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
    "๐Ÿ”ฅ": "fire",
    "๐ŸŒ‹": "volcano",
    "๐Ÿ‘จ๐Ÿฝ\u200dโš–๏ธ": "man judge: medium skin tone",
    "๐ŸŽ…๐Ÿพ": "Santa Claus: medium-dark skin tone",
    "๐Ÿ‡ฒ๐Ÿ‡ฝ": "flag: Mexico",
    "๐Ÿ‘น": "ogre",
    "๐Ÿคก": "clown face",
    "๐Ÿ‡ณ๐Ÿ‡ฎ": "flag: Nicaragua",
    "๐Ÿšฃ๐Ÿผ": "person rowing boat: medium-light skin tone",
    "๐Ÿ‚": "ox",
}

The reason that demoji requires a download rather than coming pre-packaged with Unicode emoji data is that the emoji list itself is frequently updated and changed. You are free to periodically update the local cache by calling demoji.download_codes() every so often.

To pull your last-downloaded date, you can use the last_downloaded_timestamp() helper:

>>> demoji.last_downloaded_timestamp()
datetime.datetime(2019, 2, 9, 7, 42, 24, 433776, tzinfo=<demoji.UTC object at 0x101b9ecf8>)

The result will be None if codes have not previously been downloaded.

Footnote: Emoji Sequences

Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:

  • The keycap 2๏ธโƒฃ is actually 3 characters, U+0032 (the ASCII digit 2), U+FE0F (variation selector), and U+20E3 (combining enclosing keycap).
  • The flag of Scotland 7 component characters, b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f' in full esaped notation.

(You can see any of these through s.encode("unicode-escape").)

demoji is careful to handle this and should find the full sequences rather than their incomplete subcomponents.

The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.

>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that ๐Ÿ™‹, ๐Ÿ™‹โ€โ™‚๏ธ, and ๐Ÿ™‹โ€โ™€๏ธ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape'))  # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
 b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
demoji-0.1.5-py3-none-any.whl (9.5 kB) Copy SHA256 hash SHA256 Wheel py3
demoji-0.1.5.tar.gz (5.4 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page