Extract, detect and count emoji
Project description
Emoji Extractor
Extract, detect, and count emoji from text — fast and accurate. Fully supports multi-codepoint sequences (skin tones, ZWJ sequences, flags).
Uses a trie-based greedy longest-match engine (pure Python, zero dependencies) that is 27× faster than regex for single strings and 115× faster when processing large datasets with automatic multiprocessing.
Installation
pip install emoji_extractor
Quick Start
Use the top-level convenience functions for simple tasks:
from emoji_extractor import count_emoji, detect_emoji
# Check if a string contains emoji
detect_emoji("Hello 👋") # True
detect_emoji("No emoji") # False
# Count emoji in a single string — returns a Counter
counts = count_emoji("I love 🍎 and 🍌🍌")
print(counts)
# Counter({'🍌': 2, '🍎': 1})
Single Strings vs Bulk Processing
The package provides two tiers of counting methods:
count_emoji(string) — Single string
Scans one string and returns a Counter. Fast enough for real-time use (~9µs per line).
from emoji_extractor import Extractor
ext = Extractor()
ext.count_emoji("Great job 🎉🎉🎉")
# Counter({'🎉': 3})
count_all_emoji(iterable) — Bulk processing
Processes a list (or any iterable) of strings. For inputs with 1,000+ lines, work is automatically distributed across multiple CPU cores for significantly faster throughput.
tweets = ["Love this 🍎", "So funny 😂😂", "Hello world", ...]
# Automatically parallelised for large inputs
totals = ext.count_all_emoji(tweets)
print(totals.most_common(5))
# [('😂', 2813), ('❤', 1150), ('😍', 974), ...]
| Method | Input | Parallelised? |
|---|---|---|
count_emoji(string) |
Single string | No (already ~9µs) |
count_all_emoji(iterable) |
List of strings | Yes, for ≥1000 lines |
count_tme(string) |
Single string | No |
count_all_tme(iterable) |
List of strings | Yes, for ≥1000 lines |
count_tones(string) |
Single string | No |
count_all_tones(iterable) |
List of strings | Yes, for ≥1000 lines |
Advanced Usage
Version Selection
By default, the package uses the latest Unicode Emoji data (currently 17.0). To extract emoji as defined in a specific historical version:
from emoji_extractor import Extractor
ext_14 = Extractor(version='14.0')
ext_15 = Extractor(version='15.0')
# 🩷 Pink heart was introduced in 15.0
ext_14.detect_emoji("🩷") # False
ext_15.detect_emoji("🩷") # True
Available versions: 4.0, 5.0, 11.0, 12.0, 12.1, 13.0, 14.0, 15.0, 15.1, 16.0, 17.0.
Tone-Modifiable Emoji
Count emoji that support skin tone modifiers, plus their unmodified base forms:
ext = Extractor()
ext.count_tme("High five ✋🏽")
# Counter({'✋🏽': 1})
ext.count_tones("Waves 👋🏻👋🏿")
# Counter({'🏻': 1, '🏿': 1})
Controlling Parallelism
# Use fewer workers (default: min(cpu_count, 8))
ext = Extractor(n_workers=4)
# Disable multiprocessing entirely
ext = Extractor(n_workers=1)
# Clean up worker processes when done
ext.close()
Details & Features
- Accurate Counting: Uses a greedy longest-match trie to correctly handle multi-codepoint emoji, including ZWJ sequences like
👩🦰and flag sequences like🇬🇧. - Fast: 27× faster than regex for single strings. 115× faster with parallelism for bulk data.
- Zero Dependencies: Pure Python — no external packages required.
- Historical Accuracy: Supports strict adherence to older Unicode specifications, avoiding false positives on newer emoji.
- Always Up to Date: Automatically checks for new Unicode releases via GitHub Actions and updates itself.
How It Works Under the Hood
The package relies on official Unicode data parsed from emoji-test.txt. For each supported version, the data/ folder contains:
emoji_sequences.json: All emoji strings, sorted longest-first. Used to build a nested-dict trie for greedy matching.tme_sequences.json: Tone-modifiable emoji sequences.possible_emoji.json: A set of all characters that could be part of an emoji (used bydetect_emoji()for fast presence checking).
The trie scanner walks through text character by character, always matching the longest possible emoji sequence at each position. This naturally handles cases where a shorter emoji is a prefix of a longer one (e.g., 👩 vs 👩🦰).
Note: Some emoji include a variation selector (U+FE0F), but some platforms strip it while still rendering the emoji. The trie captures both forms.
Changelog
17.0.2
- Engine: Regex replaced with pure-Python trie (27× faster single, 115× bulk with multiprocessing)
- Data:
big_regex.txt/tme_regex.txt→emoji_sequences.json/tme_sequences.json check_firstparameter is now a no-op (accepted for compatibility)count_all_*methods auto-parallelise for large inputs- Added
n_workersparameter andclose()method toExtractor - Removed
Extractor.big_regexandExtractor.tme(raise helpful error if accessed)
Other Work
If you want to do more than detecting, extracting, and counting emoji, this Python package may be useful.
Contact
Feel free to email me about any of this stuff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emoji_extractor-17.0.2.tar.gz.
File metadata
- Download URL: emoji_extractor-17.0.2.tar.gz
- Upload date:
- Size: 209.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68a4e66bca707a46b6d7c53b62e546d1215c0843b4d4adbc76fe25d621d7d28a
|
|
| MD5 |
53931671c4bee8d1dd0a7aadc0fa1b30
|
|
| BLAKE2b-256 |
14427ffbfd0cc1b655af4c3fb9cfde8874fb4bedfc76fa65e6c3d021271cdf63
|
Provenance
The following attestation bundles were made for emoji_extractor-17.0.2.tar.gz:
Publisher:
publish.yml on alexanderrobertson/emoji-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emoji_extractor-17.0.2.tar.gz -
Subject digest:
68a4e66bca707a46b6d7c53b62e546d1215c0843b4d4adbc76fe25d621d7d28a - Sigstore transparency entry: 1485932121
- Sigstore integration time:
-
Permalink:
alexanderrobertson/emoji-extractor@587abf285c9bf6651759674d587be749c6c2ed5c -
Branch / Tag:
refs/tags/17.0.2 - Owner: https://github.com/alexanderrobertson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@587abf285c9bf6651759674d587be749c6c2ed5c -
Trigger Event:
release
-
Statement type:
File details
Details for the file emoji_extractor-17.0.2-py2.py3-none-any.whl.
File metadata
- Download URL: emoji_extractor-17.0.2-py2.py3-none-any.whl
- Upload date:
- Size: 215.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
840ba9b0040cc8a6632e630f6ad28aaed1609dc13b2c5c7c25ee5330647834ff
|
|
| MD5 |
431bede4c0b21e28f060e6758b641013
|
|
| BLAKE2b-256 |
0cab5b3892b3cf2ba8ceacd34cf70950eb03c2a12e0e93cd694cfed6ae7ccaa3
|
Provenance
The following attestation bundles were made for emoji_extractor-17.0.2-py2.py3-none-any.whl:
Publisher:
publish.yml on alexanderrobertson/emoji-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
emoji_extractor-17.0.2-py2.py3-none-any.whl -
Subject digest:
840ba9b0040cc8a6632e630f6ad28aaed1609dc13b2c5c7c25ee5330647834ff - Sigstore transparency entry: 1485932141
- Sigstore integration time:
-
Permalink:
alexanderrobertson/emoji-extractor@587abf285c9bf6651759674d587be749c6c2ed5c -
Branch / Tag:
refs/tags/17.0.2 - Owner: https://github.com/alexanderrobertson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@587abf285c9bf6651759674d587be749c6c2ed5c -
Trigger Event:
release
-
Statement type: