Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cyh289

These details have not been verified by PyPI

Project description

FlashText i18n

A maintained fork of FlashText with internationalization and Unicode fixes.

Why This Fork?

The original FlashText is no longer actively maintained and has several bugs with international text:

CJK languages: Adjacent keywords not extracted (Chinese, Japanese, Korean)
Unicode case folding: Wrong span positions for characters like Turkish İ
Non-ASCII boundaries: Various edge cases with international characters

This fork aims to fix these issues while maintaining full API compatibility.

Fixed in v3.0.0

International Word Boundaries (New in v3.1.0-dev)

The original FlashText only supported ASCII characters (A-Za-z0-9_) as word parts. This caused issues for many languages where characters like é, ß, or ç were treated as delimiters, breaking words apart.

Fixed in v3.1.0: All valid Unicode alphanumeric characters are now treated as part of a word by default.

# Hindi (Devanagari)
kp.add_keyword('नमस्ते')
kp.extract_keywords('नमस्ते दुनिया') 
# ✅ ['नमस्ते'] (Previously failed)

# French/German
kp.add_keyword('café')
kp.extract_keywords('I went to a café.') 
# ✅ ['café'] (Previously extracted 'caf')

CJK Adjacent Keywords

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛')  # Estée Lauder
kp.add_keyword('小棕瓶')    # Advanced Night Repair

text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛']  ❌ Missing '小棕瓶'
# FlashText i18n:     ['雅詩蘭黛', '小棕瓶']  ✅ Both extracted!
### Loading Keywords from File (New in v3.1.0-dev)

You can now load keywords directly from JSON or text files.

```python
# keywords.json
# {
#    "Color": ["red", "blue", "green"],
#    "Vehicle": ["car", "bike"]
# }

kp.add_keyword_from_file('keywords.json')

Installation

pip install flashtext-i18n

Or using uv:

uv pip install flashtext-i18n

Or install from GitHub:

pip install git+https://github.com/termdock/flashtext-i18n.git

Usage

The API is 100% compatible with the original FlashText:

from flashtext import KeywordProcessor

# Create processor
kp = KeywordProcessor()

# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')

# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']

# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]

# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'

# Get replacement details (New in v3.1.0)
new_text, replacements = kp.replace_keywords(text, span_info=True)
# replacements = [
#     {'original': 'Python', 'replacement': 'Python', 'start': 7, 'end': 13},
#     {'original': '機器學習', 'replacement': 'Machine Learning', 'start': 18, 'end': 22}
# ]


# Extract sentences with keywords (New in v3.1.0)
sentences = kp.extract_sentences(text)
# [('I love Python and 機器學習', ['Python', 'Machine Learning'])]

# Get keyword count
print(len(kp))
# 2

# One keyword matching multiple Tags (New in v3.1.0)
kp.add_keyword('Apple', ['Fruit', 'Tech'])
keywords = kp.extract_keywords('I have an Apple')
# ['Fruit', 'Tech']

# Mixed Case Support (Case-Sensitive & Case-Insensitive) (New in v3.1.0)
# Default: case_sensitive=False (Global)
kp = KeywordProcessor()

# Add a case-insensitive keyword (matches 'banana', 'Banana', 'BANANA')
kp.add_keyword('banana')

# Add a case-sensitive keyword (matches 'Apple' ONLY)
kp.add_keyword('Apple', case_sensitive=True)

keywords_found = kp.extract_keywords('I like Apple and Banana.')
# ['Apple', 'banana']

keywords_found = kp.extract_keywords('I like apple and BANANA.')
# ['banana'] (Strict 'Apple' does not match 'apple')

Note: For high performance, FlashText merges case-insensitive paths in the internal Trie. If a case-insensitive keyword overlaps with a case-sensitive keyword (e.g. Loose us vs Strict US), they share the same path. The last added keyword will determine the replacement value for shared matches.

Fuzzy Matching (Levenshtein Distance)

FlashText supports fuzzy matching to handle typos in input text. Use max_cost to specify the maximum allowable Levenshtein distance.

kp = KeywordProcessor()
kp.add_keyword('Machine Learning')

# Exact match
kp.extract_keywords('I love Machine Learning')
# ['Machine Learning']

# Fuzzy match (max_cost=2) -> Matches "Mchine Larning" (2 deletions)
kp.extract_keywords('I love Mchine Larning', max_cost=2)
# ['Machine Learning']

# Fuzzy match for CJK (New in v3.1.0)
kp.add_keyword('人工智慧')
# Matches "人工智障" (1 substitution)
kp.extract_keywords('這有人工智障功能', max_cost=1)
# ['人工智慧']

Performance

FlashText uses the Aho-Corasick algorithm with O(n) time complexity, making it extremely fast. In v3.1.0, we introduced a Trie-based optimization for mixed-case support, eliminating runtime overhead for case-insensitive matching.

Benchmark (1000 keywords, 3.7M chars)	Time
FlashText (Case-Sensitive)	0.27s
FlashText (Case-Insensitive)	0.29s
Regex (Compiled)	~2.5s+

(Tested on Apple Silicon)

Roadmap

See Issues for planned fixes:

Unicode case folding span fix (Turkish İ, German ß) (Fixed in v3.0.0)
Keywords followed by numbers extraction (Fixed in v3.0.0)
Internationalized word boundary detection (Fixed in v3.1.0)
Indian languages (Devanagari) support (Fixed in v3.1.0)
Load keywords from JSON/Text file (Fixed in v3.1.0)

Credits

This project is a fork of FlashText created by Vikash Singh.

The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale

License

MIT License - see LICENSE file.

The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cyh289

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.0.0a12 pre-release

Jan 14, 2026

4.0.0a11 pre-release

Jan 14, 2026

4.0.0a10 pre-release

Jan 14, 2026

This version

3.1.1

Jan 13, 2026

3.0.0

Jan 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext_i18n-3.1.1.tar.gz (29.4 kB view details)

Uploaded Jan 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flashtext_i18n-3.1.1-py3-none-any.whl (15.3 kB view details)

Uploaded Jan 13, 2026 Python 3

File details

Details for the file flashtext_i18n-3.1.1.tar.gz.

File metadata

Download URL: flashtext_i18n-3.1.1.tar.gz
Upload date: Jan 13, 2026
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flashtext_i18n-3.1.1.tar.gz
Algorithm	Hash digest
SHA256	`81c03755f72543e6d4b2102279f97c1cbdea7c57358a9a2607285dbc6aaff4e5`
MD5	`cbd00aebf29bbeccad53242cce636476`
BLAKE2b-256	`6ff2d1a2a0702232dbb41debb95150d3a2a3e39ab62c4b95320b4532387476f8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-3.1.1.tar.gz:

Publisher: publish.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flashtext_i18n-3.1.1.tar.gz
- Subject digest: 81c03755f72543e6d4b2102279f97c1cbdea7c57358a9a2607285dbc6aaff4e5
- Sigstore transparency entry: 817062303
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: termdock/flashtext-i18n@935cc01dd6d2a8f19d434e1ed5ea7df1499bc5d0
- Branch / Tag: refs/tags/3.1.1
- Owner: https://github.com/termdock
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@935cc01dd6d2a8f19d434e1ed5ea7df1499bc5d0
- Trigger Event: release

File details

Details for the file flashtext_i18n-3.1.1-py3-none-any.whl.

File metadata

Download URL: flashtext_i18n-3.1.1-py3-none-any.whl
Upload date: Jan 13, 2026
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flashtext_i18n-3.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1993d2ec10cbfd84597e4d09169b6a6194a606d9bfae0f0840aef3b938539a01`
MD5	`3fc6975198496b083757e67d0fbcc371`
BLAKE2b-256	`249783a3531848c3111b60a932fdae8f9c5f6e3e5303ee0c348073054746e12b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-3.1.1-py3-none-any.whl:

Publisher: publish.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flashtext_i18n-3.1.1-py3-none-any.whl
- Subject digest: 1993d2ec10cbfd84597e4d09169b6a6194a606d9bfae0f0840aef3b938539a01
- Sigstore transparency entry: 817062366
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: termdock/flashtext-i18n@935cc01dd6d2a8f19d434e1ed5ea7df1499bc5d0
- Branch / Tag: refs/tags/3.1.1
- Owner: https://github.com/termdock
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@935cc01dd6d2a8f19d434e1ed5ea7df1499bc5d0
- Trigger Event: release

flashtext-i18n 3.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

FlashText i18n

Why This Fork?

Fixed in v3.0.0

International Word Boundaries (New in v3.1.0-dev)

CJK Adjacent Keywords

Installation

Usage

Fuzzy Matching (Levenshtein Distance)

Performance

Roadmap

Credits

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance