Skip to main content

Extract, detect and count emoji

Project description

Emoji Extractor

Extract and count emoji from text efficiently and accurately. Fully supports multi-part emoji (skin tones, zero-width joiners, flags).

Installation

pip install emoji_extractor

Usage examples: see this Jupyter notebook

Quick Start

You can use the top-level convenience functions to extract emoji using the default (latest) Unicode version:

from emoji_extractor import count_emoji, detect_emoji

# Returns a Counter object of emojis and their counts
counts = count_emoji("I love apples 🍎 and bananas 🍌🍌")
print(counts)
# Counter({'🍌': 2, '🍎': 1})

# Check if a string has emoji
has_emoji = detect_emoji("No emoji here") # False

Advanced Usage (Version Selection)

By default, the package uses the latest available Unicode Emoji data. If you need to extract emoji precisely as they were defined in a specific historical Unicode version, instantiate the Extractor class:

from emoji_extractor import Extractor

# Initialise an extractor for a specific version
ext_14 = Extractor(version='14.0')
ext_15 = Extractor(version='15.0')

# 🩷 Pink heart was introduced in 15.0
print(ext_14.detect_emoji("🩷")) # False
print(ext_15.detect_emoji("🩷")) # True

Available versions: 4.0, 5.0, 11.0, 12.0, 12.1, 13.0, 13.1, 14.0, 15.0, 15.1, 16.0, 17.0.

Details & Features

  • Accurate Counting: Uses dynamically generated regular expressions to properly capture multi-codepoint sequences, including ZWJ sequences like '💁🏽‍♂️' and flags.
  • Historical Accuracy: Supports strict adherence to older Unicode specifications, avoiding false positives on newer emoji.
  • Always Up to Date: Automatically checks for new Unicode releases via GitHub Actions and updates itself.

How it works under the hood

The package relies on official Unicode data parsed from emoji-test.txt. Inside the data/ folder for each version, it generates:

  • possible_emoji.json: A set of all characters that could possibly be part of an emoji (used as a fast initial filter before checking the regex).
  • big_regex.txt: A massive list of exact matching strings piped together in order of decreasing length. This guarantees multi-part emojis are matched before their individual components.
  • tme_regex.txt: Regex definitions for Tone-Modifiable Emoji.

(Note: Prior versions of this package used .pkl files, but we have migrated to standard formats like JSON/TXT for better security and cross-platform compatibility).

Some emoji have a variation selector 0xFE0F, but some platforms strip these and still render the emoji form. However, the regex used here will capture both (e.g. 0xFE0F after each emoji codepoint vs no 0xFE0F). See Unicode's Full Emoji List and search for '0xFE0F' to see which emoji this potentially affects.

Other work

If you want to do stuff more complicated than simply detecting, extracting and counting emoji then you might find this Python package useful.

Anything else

Feel free to email me about any of this stuff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emoji_extractor-17.0.1.tar.gz (203.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emoji_extractor-17.0.1-py2.py3-none-any.whl (204.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file emoji_extractor-17.0.1.tar.gz.

File metadata

  • Download URL: emoji_extractor-17.0.1.tar.gz
  • Upload date:
  • Size: 203.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for emoji_extractor-17.0.1.tar.gz
Algorithm Hash digest
SHA256 22a16b056648a3d69554ff1f639b0e9760be95b95b772d77e68646236d0e3ed6
MD5 0efd2d990d8bca3b7ebfb38f5a8dd059
BLAKE2b-256 974a0986c59879103b85af82c21ac4f12fe229169671810b3f86e139e7616b93

See more details on using hashes here.

Provenance

The following attestation bundles were made for emoji_extractor-17.0.1.tar.gz:

Publisher: publish.yml on alexanderrobertson/emoji-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file emoji_extractor-17.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for emoji_extractor-17.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9e1678a564918ba79f9488bf906823f7570207027d5d30bf2d58257a4475c899
MD5 356d6c9b9614256b856fdeca1bb333d2
BLAKE2b-256 55908f5af311cc12f9d4eb4410b0dcb794ff437c0404eaab96122b58ee0e2bda

See more details on using hashes here.

Provenance

The following attestation bundles were made for emoji_extractor-17.0.1-py2.py3-none-any.whl:

Publisher: publish.yml on alexanderrobertson/emoji-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page