Skip to main content

Extract, detect and count emoji

Project description

Emoji Extractor

Extract and count emoji from text efficiently and accurately. Fully supports multi-part emoji (skin tones, zero-width joiners, flags).

Installation

pip install emoji_extractor

Usage examples: see this Jupyter notebook

Quick Start

You can use the top-level convenience functions to extract emoji using the default (latest) Unicode version:

from emoji_extractor import count_emoji, detect_emoji

# Returns a Counter object of emojis and their counts
counts = count_emoji("I love apples 🍎 and bananas 🍌🍌")
print(counts)
# Counter({'🍌': 2, '🍎': 1})

# Check if a string has emoji
has_emoji = detect_emoji("No emoji here") # False

Advanced Usage (Version Selection)

By default, the package uses the latest available Unicode Emoji data. If you need to extract emoji precisely as they were defined in a specific historical Unicode version, instantiate the Extractor class:

from emoji_extractor import Extractor

# Initialise an extractor for a specific version
ext_14 = Extractor(version='14.0')
ext_15 = Extractor(version='15.0')

# 🩷 Pink heart was introduced in 15.0
print(ext_14.detect_emoji("🩷")) # False
print(ext_15.detect_emoji("🩷")) # True

Available versions: 4.0, 5.0, 11.0, 12.0, 12.1, 13.0, 13.1, 14.0, 15.0, 15.1, 16.0, 17.0.

Details & Features

  • Accurate Counting: Uses dynamically generated regular expressions to properly capture multi-codepoint sequences, including ZWJ sequences like '💁🏽‍♂️' and flags.
  • Historical Accuracy: Supports strict adherence to older Unicode specifications, avoiding false positives on newer emoji.
  • Always Up to Date: Automatically checks for new Unicode releases via GitHub Actions and updates itself.

How it works under the hood

The package relies on official Unicode data parsed from emoji-test.txt. Inside the data/ folder for each version, it generates:

  • possible_emoji.json: A set of all characters that could possibly be part of an emoji (used as a fast initial filter before checking the regex).
  • big_regex.txt: A massive list of exact matching strings piped together in order of decreasing length. This guarantees multi-part emojis are matched before their individual components.
  • tme_regex.txt: Regex definitions for Tone-Modifiable Emoji.

(Note: Prior versions of this package used .pkl files, but we have migrated to standard formats like JSON/TXT for better security and cross-platform compatibility).

Some emoji have a variation selector 0xFE0F, but some platforms strip these and still render the emoji form. However, the regex used here will capture both (e.g. 0xFE0F after each emoji codepoint vs no 0xFE0F). See Unicode's Full Emoji List and search for '0xFE0F' to see which emoji this potentially affects.

Other work

If you want to do stuff more complicated than simply detecting, extracting and counting emoji then you might find this Python package useful.

Anything else

Feel free to email me about any of this stuff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emoji_extractor-17.0.tar.gz (203.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emoji_extractor-17.0-py2.py3-none-any.whl (204.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file emoji_extractor-17.0.tar.gz.

File metadata

  • Download URL: emoji_extractor-17.0.tar.gz
  • Upload date:
  • Size: 203.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for emoji_extractor-17.0.tar.gz
Algorithm Hash digest
SHA256 4d47851132a2cdba5728a0fe2aaba8bb41c6cb438077bb9610889fd036569251
MD5 479bf45bfb2688573c9ef5a133f33049
BLAKE2b-256 1aa1c104d19f656210fa7559e89661138683bd54dc1692fe6c671755fafdde74

See more details on using hashes here.

Provenance

The following attestation bundles were made for emoji_extractor-17.0.tar.gz:

Publisher: publish.yml on alexanderrobertson/emoji-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file emoji_extractor-17.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for emoji_extractor-17.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b12f24bc8c2a1168f4fb85f39545b753a767f637a6029f71c92d7fde31bce73f
MD5 96d805dbe18a83546f135d96d439c97b
BLAKE2b-256 6017fc8342152d9eaefc60652ed676e795a07c5b1dd3a947f902a3f4687ccbd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for emoji_extractor-17.0-py2.py3-none-any.whl:

Publisher: publish.yml on alexanderrobertson/emoji-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page