Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.
Project description
FlashText i18n
A maintained fork of FlashText with internationalization and Unicode fixes.
Why This Fork?
The original FlashText is no longer actively maintained and has several bugs with international text:
- CJK languages: Adjacent keywords not extracted (Chinese, Japanese, Korean)
- Unicode case folding: Wrong span positions for characters like Turkish
İ - Non-ASCII boundaries: Various edge cases with international characters
This fork aims to fix these issues while maintaining full API compatibility.
Fixed in v3.0.0
CJK Adjacent Keywords
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛') # Estée Lauder
kp.add_keyword('小棕瓶') # Advanced Night Repair
text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛'] ❌ Missing '小棕瓶'
# FlashText i18n: ['雅詩蘭黛', '小棕瓶'] ✅ Both extracted!
Installation
pip install flashtext-i18n
Or install from GitHub:
pip install git+https://github.com/termdock/flashtext-i18n.git
Usage
The API is 100% compatible with the original FlashText:
from flashtext import KeywordProcessor
# Create processor
kp = KeywordProcessor()
# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')
# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']
# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]
# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'
Performance
FlashText uses the Aho-Corasick algorithm with O(n) time complexity, making it extremely fast for keyword extraction from large texts.
| Benchmark | FlashText | Regex |
|---|---|---|
| 1000 keywords, 1M chars | ~0.1s | ~10s+ |
Roadmap
See Issues for planned fixes:
- Unicode case folding span fix (Turkish İ, German ß)
- Keywords followed by numbers extraction
- Internationalized word boundary detection
- Indian languages (Devanagari) support
Credits
This project is a fork of FlashText created by Vikash Singh.
The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale
License
MIT License - see LICENSE file.
The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flashtext_i18n-3.0.0.tar.gz.
File metadata
- Download URL: flashtext_i18n-3.0.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0f376a9c6f30d4d0a314c7bc57d639182dc885ca3cabd53db885da19666e42f
|
|
| MD5 |
593e9e5884805aa754b6f24e6512d61f
|
|
| BLAKE2b-256 |
b2d7baf6bef0107b8a9cd17a50fb16a6cc456d535fff7734b26e1d319344068c
|
Provenance
The following attestation bundles were made for flashtext_i18n-3.0.0.tar.gz:
Publisher:
publish.yml on termdock/flashtext-i18n
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flashtext_i18n-3.0.0.tar.gz -
Subject digest:
c0f376a9c6f30d4d0a314c7bc57d639182dc885ca3cabd53db885da19666e42f - Sigstore transparency entry: 815873017
- Sigstore integration time:
-
Permalink:
termdock/flashtext-i18n@11d6ea7b0d0bc28acd5e625a9b95db6e0f684542 -
Branch / Tag:
refs/tags/3.0.0 - Owner: https://github.com/termdock
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@11d6ea7b0d0bc28acd5e625a9b95db6e0f684542 -
Trigger Event:
release
-
Statement type:
File details
Details for the file flashtext_i18n-3.0.0-py3-none-any.whl.
File metadata
- Download URL: flashtext_i18n-3.0.0-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b47444e40b90a53e3d4b5369e96c74f5d08435b14233005ffc25021b1689702
|
|
| MD5 |
280901431f83f2e3348b01471eab238b
|
|
| BLAKE2b-256 |
34ffd9e696f8ba6d65f5679426d10c801a37f90252000b9a9beaedad89afd0aa
|
Provenance
The following attestation bundles were made for flashtext_i18n-3.0.0-py3-none-any.whl:
Publisher:
publish.yml on termdock/flashtext-i18n
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flashtext_i18n-3.0.0-py3-none-any.whl -
Subject digest:
7b47444e40b90a53e3d4b5369e96c74f5d08435b14233005ffc25021b1689702 - Sigstore transparency entry: 815873023
- Sigstore integration time:
-
Permalink:
termdock/flashtext-i18n@11d6ea7b0d0bc28acd5e625a9b95db6e0f684542 -
Branch / Tag:
refs/tags/3.0.0 - Owner: https://github.com/termdock
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@11d6ea7b0d0bc28acd5e625a9b95db6e0f684542 -
Trigger Event:
release
-
Statement type: