Skip to main content

Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.

Project description

FlashText i18n

The Modern, High-Performance FlashText (Rust Core)

English | 繁體中文

A high-performance keyword extraction and replacement library, powered by Rust.

This is a complete modernization of the original FlashText algorithm. While it started as a fork to fix internationalization (i18n) bugs, it has evolved into a full-featured, high-performance engine rewritten in Rust.

It offers 3x-4x faster performance, 100% correct Unicode handling, and new features (Fuzzy Matching, Mixed-Case support) while maintaining API compatibility.

PyPI version Python Versions License: MIT

Why use this instead of the original?

The original flashtext library has been unmaintained for years and suffers from fundamental issues:

  • Incorrect Word Boundaries: Fails on non-ASCII characters (e.g., CJK adjacent words, German ß, French é).
  • Performance Limits: Pure Python implementation hits a bottleneck with large keyword sets.
  • No Fuzzy Matching: Cannot handle typos or minor variations.

FlashText i18n (v4.0) solves all of these:

  1. Rust Core (Blazing Fast): The heavy lifting is done in Rust, offering identical 0(N) complexity but with ~4x raw throughput and constant memory scaling.
  2. True Unicode Support: We use Rust's robust unicode-segmentation to correctly identify word boundaries in any language (Chinese, Japanese, Korean, Thai, Hindi, etc.).
  3. Expanded Features:
    • Fuzzy Matching: Levenshtein distance support for extracting slightly misspelled keywords.
    • Mixed Case Mode: Support simultaneous case-sensitive and case-insensitive keywords.
    • Rich Metadata: Extract detailed span (start/end) and replacement information.
  4. Drop-in Replacement: You can switch from flashtext by changing one line of code (or just the install command).

Version History

v4.0.0 (The Rust Era) - Alpha

  • Rust Integration: Core logic rewritten in Rust for speed and safety.
  • New Features: Fuzzy matching, JSON file loading, sentence extraction.
  • Universal Wheels: Pre-compiled binaries for Windows, macOS (Intel/Silicon), Linux (gnu/musl) and Aarch64.
  • Compatibility: 100% Drop-in replacement for Python API.

New Features

  • International Word Boundaries: Unicode-aware boundary detection.
  • Load Keywords from File: Support for JSON/Text files.
  • Mixed Case Support: Case-sensitive and Case-insensitive coexistence.
  • Fuzzy Matching: Optional Levenshtein support.
  • New APIs: Extract sentences, replacement metadata.

v3.0.0 (Python Core) - Released

  • Unicode case folding: Correct spans for Turkish İ and German ß
  • Numbers: Keywords followed by numbers are now extracted correctly
  • CJK Support: Adjacent keywords (Chinese/Japanese) now extracted correctly

Feature Highlights

International Word Boundaries (v4.0)

The original FlashText only supported ASCII characters (A-Za-z0-9_) as word parts. This caused issues for many languages where characters like é, ß, or ç were treated as delimiters, breaking words apart.

v4.0 Fix: All valid Unicode alphanumeric characters are now treated as part of a word by default.

# Hindi (Devanagari)
kp.add_keyword('नमस्ते')
kp.extract_keywords('नमस्ते दुनिया') 
# ✅ ['नमस्ते'] (Previously failed)

# French/German
kp.add_keyword('café')
kp.extract_keywords('I went to a café.') 
# ✅ ['café'] (Previously extracted 'caf')

CJK Adjacent Keywords (v3.0)

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛')  # Estée Lauder
kp.add_keyword('小棕瓶')    # Advanced Night Repair

text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛']  ❌ Missing '小棕瓶'
# FlashText i18n:     ['雅詩蘭黛', '小棕瓶']  ✅ Both extracted!

Loading Keywords from File (v4.0)

You can now load keywords directly from JSON or text files.

# keywords.json
# {
#    "Color": ["red", "blue", "green"],
#    "Vehicle": ["car", "bike"]
# }

kp.add_keywords_from_file('keywords.json')

Installation

pip install flashtext-i18n

Note: This package provides a drop-in replacement module named flashtext. Please uninstall the original flashtext package first to avoid conflicts.

pip uninstall -y flashtext
pip uninstall -y flashtext-i18n # optional cleanup
pip install -U flashtext-i18n

Or using uv:

uv pip install flashtext-i18n

Or install from GitHub:

pip install git+https://github.com/termdock/flashtext-i18n.git

Usage

The API is 100% compatible with the original FlashText:

from flashtext import KeywordProcessor

# Create processor
kp = KeywordProcessor()

# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')

# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']

# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]

# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'

# Get replacement details (New in v4.0)
new_text, replacements = kp.replace_keywords(text, span_info=True)
# replacements = [
#     {'original': 'Python', 'replacement': 'Python', 'start': 7, 'end': 13},
#     {'original': '機器學習', 'replacement': 'Machine Learning', 'start': 18, 'end': 22}
# ]


# Extract sentences with keywords (New in v4.0)
sentences = kp.extract_sentences(text)
# [('I love Python and 機器學習', ['Python', 'Machine Learning'])]

# Get keyword count
print(len(kp))
# 2

# One keyword matching multiple Tags (New in v4.0)
kp.add_keyword('Apple', ['Fruit', 'Tech'])
keywords = kp.extract_keywords('I have an Apple')
# ['Fruit', 'Tech']

# Mixed Case Support (Case-Sensitive & Case-Insensitive) (New in v4.0)
# Default: case_sensitive=False (Global)
kp = KeywordProcessor()

# Add a case-insensitive keyword (matches 'banana', 'Banana', 'BANANA')
kp.add_keyword('banana')

# Add a case-sensitive keyword (matches 'Apple' ONLY)
kp.add_keyword('Apple', case_sensitive=True)

keywords_found = kp.extract_keywords('I like Apple and Banana.')
# ['Apple', 'banana']

keywords_found = kp.extract_keywords('I like apple and BANANA.')
# ['banana'] (Strict 'Apple' does not match 'apple')

> **Note**: **Shared Trie Path Tradeoff**. If you add `Apple` (Case-Sensitive) and `apple` (Insensitive), they share the path a-p-p-l-e. The last definition wins. **Recommendation**: Add case-sensitive keywords *after* case-insensitive ones if strict separation is needed.

### Fuzzy Matching (Levenshtein Distance)

FlashText supports fuzzy matching to handle typos.

> **Warning**: Fuzzy matching introduces additional Levenshtein distance calculation overhead, making it **significantly slower** than exact matching. Use only when necessary.

Use `max_cost` to specify the maximum allowable Levenshtein distance.

```python
kp = KeywordProcessor()
kp.add_keyword('Machine Learning')

# Exact match
kp.extract_keywords('I love Machine Learning')
# ['Machine Learning']

# Fuzzy match (max_cost=2) -> Matches "Mchine Larning" (2 deletions)
kp.extract_keywords('I love Mchine Larning', max_cost=2)
# ['Machine Learning']

# Fuzzy match for CJK (New in v4.0)
kp.add_keyword('人工智慧')
# Matches "人工智障" (1 substitution)
kp.extract_keywords('這有人工智障功能', max_cost=1)
# ['人工智慧']

Performance (v4.0 Rust Core)

Comparison of FlashText 4.0 (Rust), FlashText 3.0 (Python), and Regex (compiled).

Benchmark Methodology

  • Corpus: 10,000 lines (Short sentences, simulated natural language).
  • Terms: 1,000 to 100,000 unique keywords.
  • Metric: Median Match Time (Seconds) over 10 iterations (Warmup enabled).
  • Environment: Apple Silicon (M1/M2/M3), Python 3.11.

Results: Keyword Extraction Time (Lower is Better)

Keywords Rust (v4.0) Python (v3.0) Regex Speedup (vs Py) Speedup (vs Re)
1,000 0.012s 0.043s 0.92s 3.6x 76x
5,000 0.013s 0.042s 4.80s 3.2x 369x
20,000 0.018s 0.046s 19.16s 2.6x 1064x
100,000 0.021s 0.056s N/A 2.7x N/A

Note: Rust match latency remains nearly constant as keyword count scales from 1k to 100k (on this corpus). Regex performance degrades sharply as the number of alternations grows, making it unsuitable for large keyword sets. Rust reduces per-character overhead and memory allocations, resulting in a consistent 2.6x to 3.6x speedup over the Python implementation.

Match Time (Figure 1: Comparison vs Regex - Rust is 1000x faster)

Match Time Rust vs Python (Figure 2: Comparison vs Python - Rust is ~3x faster and scales better)

Build Time (Index Construction)

Keywords Rust (v4.0) Python (v3.0)
100,000 0.08s 0.17s

Rust constructs the keyword trie index 2x faster than Python. (Build time measured on the same machine, release build, 10 iterations)

Roadmap

See Issues for planned fixes:

  • Unicode case folding span fix (Turkish İ, German ß) (Fixed in v3.0.0)
  • Keywords followed by numbers extraction (Fixed in v3.0.0)
  • Internationalized word boundary detection (Fixed in v4.0)
  • Indian languages (Devanagari) support (Fixed in v4.0)
  • Load keywords from JSON/Text file (Fixed in v4.0)

Credits

This project is a fork of FlashText created by Vikash Singh.

The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale

License

MIT License - see LICENSE file.

The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flashtext_i18n-4.0.0a11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (338.0 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (340.6 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a11-cp38-abi3-win_amd64.whl (204.4 kB view details)

Uploaded CPython 3.8+Windows x86-64

flashtext_i18n-4.0.0a11-cp38-abi3-musllinux_1_2_x86_64.whl (551.3 kB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (346.1 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (340.2 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

flashtext_i18n-4.0.0a11-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (608.3 kB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file flashtext_i18n-4.0.0a11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2bbaeee54ab3fdb6a093e1bcf2e40fc6404c45185d9f30fb1cd05425f9f057d3
MD5 859525cf7188e7b31e3723a4020bbea2
BLAKE2b-256 9228641228ac958ecf2fa9e462c98b758ae966084b090560d7223aa96977485d

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5e5c100adae8b6dfeb48a5d2c488156eb3d724923e5e564ceb72a2d2e438d32c
MD5 4d5117f63ef43117a684dabc5d0cd621
BLAKE2b-256 3a0a1ede30c94aaf7f54e195bb17ea0d01ff5103c3a8643a023d2ea674364827

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 cd2d36bbf092420e1bdec2963d57b519b0c8339d8d8a0ef5b48445f148112c02
MD5 79e00be42b9919d0ce4bc681f9021caa
BLAKE2b-256 e20cd6e77f80a3bfa4a28804a951fd43092fa5bf3d1d2f57b0eddec3ad6d9b8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-cp38-abi3-win_amd64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8327582912f50d6d8698b8a8027fe63914dae158377b18dcbb9eba2a5295b578
MD5 07a5273b08debdd757e48df0862dd5aa
BLAKE2b-256 9ff3e19b3626e8cedf7f65e2ad8ba84783e805b2c158d5aba87deba2a8a01238

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4e6cee6de06ec2e531d3d73332dfdec01c8f0b8d33d49c2afe9be8515ae49f2f
MD5 296a95665e9eaf9c4e258379d79dce74
BLAKE2b-256 a4911b00a64f223b29ca9211ab0b68e5f7eb14797ffcfdfac2d0d5b366309b3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 afc50796320cd84cc95d26857c069f2a5893794b5f337b287d858e646e268a81
MD5 753b2333e8cd00765171611e96c3cf02
BLAKE2b-256 e242cbc6ac3eaee53f6d4c9334c2258988454175f72c7b5733062a72d5cfd289

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a11-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a11-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 a1d79557c8702a1cab3fb49dbd9f70aaf63d0daf25d682464851565774ad59a4
MD5 5d3027c59eb71dc217531b3ef6830b8e
BLAKE2b-256 36c118eca696340e54bb6bebd9e50bd16f089ea496cda1eb81ccc4d1b4a0e3fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a11-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page