Skip to main content

Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.

Project description

FlashText i18n

The Modern, High-Performance FlashText (Rust Core)

English | 繁體中文

A high-performance keyword extraction and replacement library, powered by Rust.

This is a complete modernization of the original FlashText algorithm. While it started as a fork to fix internationalization (i18n) bugs, it has evolved into a full-featured, high-performance engine rewritten in Rust.

It offers 3x-4x faster performance, 100% correct Unicode handling, and new features (Fuzzy Matching, Mixed-Case support) while maintaining API compatibility.

PyPI version Python Versions License: MIT

Why use this instead of the original?

The original flashtext library has been unmaintained for years and suffers from fundamental issues:

  • Incorrect Word Boundaries: Fails on non-ASCII characters (e.g., CJK adjacent words, German ß, French é).
  • Performance Limits: Pure Python implementation hits a bottleneck with large keyword sets.
  • No Fuzzy Matching: Cannot handle typos or minor variations.

FlashText i18n (v4.0) solves all of these:

  1. Rust Core (Blazing Fast): The heavy lifting is done in Rust, offering identical 0(N) complexity but with ~4x raw throughput and constant memory scaling.
  2. True Unicode Support: We use Rust's robust unicode-segmentation to correctly identify word boundaries in any language (Chinese, Japanese, Korean, Thai, Hindi, etc.).
  3. Expanded Features:
    • Fuzzy Matching: Levenshtein distance support for extracting slightly misspelled keywords.
    • Mixed Case Mode: Support simultaneous case-sensitive and case-insensitive keywords.
    • Rich Metadata: Extract detailed span (start/end) and replacement information.
  4. Drop-in Replacement: You can switch from flashtext by changing one line of code (or just the install command).

Version History

v4.0.0 (The Rust Era) - Alpha

  • Rust Integration: Core logic rewritten in Rust for speed and safety.
  • New Features: Fuzzy matching, JSON file loading, sentence extraction.
  • Universal Wheels: Pre-compiled binaries for Windows, macOS (Intel/Silicon), Linux (gnu/musl) and Aarch64.
  • Compatibility: 100% Drop-in replacement for Python API.

New Features

  • International Word Boundaries: Unicode-aware boundary detection.
  • Load Keywords from File: Support for JSON/Text files.
  • Mixed Case Support: Case-sensitive and Case-insensitive coexistence.
  • Fuzzy Matching: Optional Levenshtein support.
  • New APIs: Extract sentences, replacement metadata.

v3.0.0 (Python Core) - Released

  • Unicode case folding: Correct spans for Turkish İ and German ß
  • Numbers: Keywords followed by numbers are now extracted correctly
  • CJK Support: Adjacent keywords (Chinese/Japanese) now extracted correctly

Feature Highlights

International Word Boundaries (v4.0)

The original FlashText only supported ASCII characters (A-Za-z0-9_) as word parts. This caused issues for many languages where characters like é, ß, or ç were treated as delimiters, breaking words apart.

v4.0 Fix: All valid Unicode alphanumeric characters are now treated as part of a word by default.

# Hindi (Devanagari)
kp.add_keyword('नमस्ते')
kp.extract_keywords('नमस्ते दुनिया') 
# ✅ ['नमस्ते'] (Previously failed)

# French/German
kp.add_keyword('café')
kp.extract_keywords('I went to a café.') 
# ✅ ['café'] (Previously extracted 'caf')

CJK Adjacent Keywords (v3.0)

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛')  # Estée Lauder
kp.add_keyword('小棕瓶')    # Advanced Night Repair

text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛']  ❌ Missing '小棕瓶'
# FlashText i18n:     ['雅詩蘭黛', '小棕瓶']  ✅ Both extracted!

Loading Keywords from File (v4.0)

You can now load keywords directly from JSON or text files.

# keywords.json
# {
#    "Color": ["red", "blue", "green"],
#    "Vehicle": ["car", "bike"]
# }

kp.add_keywords_from_file('keywords.json')

Installation

pip install flashtext-i18n

Note: This package provides a drop-in replacement module named flashtext. Please uninstall the original flashtext package first to avoid conflicts.

pip uninstall -y flashtext
pip uninstall -y flashtext-i18n # optional cleanup
pip install -U flashtext-i18n

Or using uv:

uv pip install flashtext-i18n

Or install from GitHub:

pip install git+https://github.com/termdock/flashtext-i18n.git

Usage

The API is 100% compatible with the original FlashText:

from flashtext import KeywordProcessor

# Create processor
kp = KeywordProcessor()

# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')

# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']

# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]

# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'

# Get replacement details (New in v4.0)
new_text, replacements = kp.replace_keywords(text, span_info=True)
# replacements = [
#     {'original': 'Python', 'replacement': 'Python', 'start': 7, 'end': 13},
#     {'original': '機器學習', 'replacement': 'Machine Learning', 'start': 18, 'end': 22}
# ]


# Extract sentences with keywords (New in v4.0)
sentences = kp.extract_sentences(text)
# [('I love Python and 機器學習', ['Python', 'Machine Learning'])]

# Get keyword count
print(len(kp))
# 2

# One keyword matching multiple Tags (New in v4.0)
kp.add_keyword('Apple', ['Fruit', 'Tech'])
keywords = kp.extract_keywords('I have an Apple')
# ['Fruit', 'Tech']

# Mixed Case Support (Case-Sensitive & Case-Insensitive) (New in v4.0)
# Default: case_sensitive=False (Global)
kp = KeywordProcessor()

# Add a case-insensitive keyword (matches 'banana', 'Banana', 'BANANA')
kp.add_keyword('banana')

# Add a case-sensitive keyword (matches 'Apple' ONLY)
kp.add_keyword('Apple', case_sensitive=True)

keywords_found = kp.extract_keywords('I like Apple and Banana.')
# ['Apple', 'banana']

keywords_found = kp.extract_keywords('I like apple and BANANA.')
# ['banana'] (Strict 'Apple' does not match 'apple')

> **Note**: **Shared Trie Path Tradeoff**. If you add `Apple` (Case-Sensitive) and `apple` (Insensitive), they share the path a-p-p-l-e. The last definition wins. **Recommendation**: Add case-sensitive keywords *after* case-insensitive ones if strict separation is needed.

### Fuzzy Matching (Levenshtein Distance)

FlashText supports fuzzy matching to handle typos.

> **Warning**: Fuzzy matching introduces additional Levenshtein distance calculation overhead, making it **significantly slower** than exact matching. Use only when necessary.

Use `max_cost` to specify the maximum allowable Levenshtein distance.

```python
kp = KeywordProcessor()
kp.add_keyword('Machine Learning')

# Exact match
kp.extract_keywords('I love Machine Learning')
# ['Machine Learning']

# Fuzzy match (max_cost=2) -> Matches "Mchine Larning" (2 deletions)
kp.extract_keywords('I love Mchine Larning', max_cost=2)
# ['Machine Learning']

# Fuzzy match for CJK (New in v4.0)
kp.add_keyword('人工智慧')
# Matches "人工智障" (1 substitution)
kp.extract_keywords('這有人工智障功能', max_cost=1)
# ['人工智慧']

Performance (v4.0 Rust Core)

Comparison of FlashText 4.0 (Rust), FlashText 3.0 (Python), and Regex (compiled).

Benchmark Methodology

  • Corpus: 10,000 lines (Short sentences, simulated natural language).
  • Terms: 1,000 to 100,000 unique keywords.
  • Metric: Median Match Time (Seconds) over 10 iterations (Warmup enabled).
  • Environment: Apple Silicon (M1/M2/M3), Python 3.11.

Results: Keyword Extraction Time (Lower is Better)

Keywords Rust (v4.0) Python (v3.0) Regex Speedup (vs Py) Speedup (vs Re)
1,000 0.012s 0.043s 0.92s 3.6x 76x
5,000 0.013s 0.042s 4.80s 3.2x 369x
20,000 0.018s 0.046s 19.16s 2.6x 1064x
100,000 0.021s 0.056s N/A 2.7x N/A

Note: Rust match latency remains nearly constant as keyword count scales from 1k to 100k (on this corpus). Regex performance degrades sharply as the number of alternations grows, making it unsuitable for large keyword sets. Rust reduces per-character overhead and memory allocations, resulting in a consistent 2.6x to 3.6x speedup over the Python implementation.

Match Time (Figure 1: Comparison vs Regex - Rust is 1000x faster)

Match Time Rust vs Python (Figure 2: Comparison vs Python - Rust is ~3x faster and scales better)

Build Time (Index Construction)

Keywords Rust (v4.0) Python (v3.0)
100,000 0.08s 0.17s

Rust constructs the keyword trie index 2x faster than Python. (Build time measured on the same machine, release build, 10 iterations)

Roadmap

See Issues for planned fixes:

  • Unicode case folding span fix (Turkish İ, German ß) (Fixed in v3.0.0)
  • Keywords followed by numbers extraction (Fixed in v3.0.0)
  • Internationalized word boundary detection (Fixed in v4.0)
  • Indian languages (Devanagari) support (Fixed in v4.0)
  • Load keywords from JSON/Text file (Fixed in v4.0)

Credits

This project is a fork of FlashText created by Vikash Singh.

The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale

License

MIT License - see LICENSE file.

The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flashtext_i18n-4.0.0a10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (338.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (340.7 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a10-cp38-abi3-win_amd64.whl (204.4 kB view details)

Uploaded CPython 3.8+Windows x86-64

flashtext_i18n-4.0.0a10-cp38-abi3-musllinux_1_2_x86_64.whl (551.4 kB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (346.1 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (340.4 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a10-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (607.8 kB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file flashtext_i18n-4.0.0a10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 58d207c23e871311c2802527775961dfb3c0369c87cad759014f4cf8d7a8758d
MD5 043ca149c07941cafe005c113c90b151
BLAKE2b-256 b3f76126fc685dcdb889ffb6da208faf1d4851bc13a89be873a9e32d043f88ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 74cf6c31baf4815a757aa36f2253307adbd9c1fd2987289777874183d9dc7196
MD5 be80bd2a92930414a63d86761cd94e15
BLAKE2b-256 e543e9a2c20189e47110543873d46a437effedf43f8a9853d07f92c13dcbb795

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2c555fb9c9a82d99945e4456453415a21b98b1a094bd4cc72cc4e5dbdcd26092
MD5 87cd0028ae83b2115af9c909de6ea756
BLAKE2b-256 9f3adf65f41842608c6fef2b9e7727a86e6f2c292365d626c1ce4827a361d14c

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-cp38-abi3-win_amd64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 92544eab970dcc84743f422a234ae96902c279d13761f0290680f0ea6401f4fa
MD5 a43662b200c813d3028f984ba4777881
BLAKE2b-256 7e4b9da823ee8ef3885248d4b50ed31e52cf66132d93117cd70cc96f02d39e10

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 269086f5ccd571a1260efa71cbd798e33ef790f41dd05406902d56ca207cbee9
MD5 f33b780b95f666e0aa07087bfe7bfa45
BLAKE2b-256 d3d2ac80617e16c9d07b8b97ac8adc17a9e207166a0a8cb1d1b4ea863bb42d66

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 45a7004ffa4395f1e3f588ef3645e72bb38b435a91b3c8ec1c26696a0d88f0fe
MD5 a42b28918efec41261d8e498e05b2c4f
BLAKE2b-256 f2c23cef40087958330216b4da35cb91a30d220adf19e669a9a76dbf41893ca0

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a10-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a10-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 25a256aab7d3393290fbc45e8133899f12dd338e6608e2a79c1c54b59d662e77
MD5 49b8135ed48bc890a17c70a4b62bf17b
BLAKE2b-256 2f397249d7766de6ca4dead83fb6597fb22f3ef7a8f2664ca8f4b6214dca5f5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a10-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page