Skip to main content

Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.

Project description

FlashText i18n

The Modern, High-Performance FlashText (Rust Core)

English | 繁體中文

A high-performance keyword extraction and replacement library, powered by Rust.

This is a complete modernization of the original FlashText algorithm. While it started as a fork to fix internationalization (i18n) bugs, it has evolved into a full-featured, high-performance engine rewritten in Rust.

It offers 3x-4x faster performance, 100% correct Unicode handling, and new features (Fuzzy Matching, Mixed-Case support) while maintaining API compatibility.

PyPI version Python Versions License: MIT

Why use this instead of the original?

The original flashtext library has been unmaintained for years and suffers from fundamental issues:

  • Incorrect Word Boundaries: Fails on non-ASCII characters (e.g., CJK adjacent words, German ß, French é).
  • Performance Limits: Pure Python implementation hits a bottleneck with large keyword sets.
  • No Fuzzy Matching: Cannot handle typos or minor variations.

FlashText i18n (v4.0) solves all of these:

  1. Rust Core (Blazing Fast): The heavy lifting is done in Rust, offering identical 0(N) complexity but with ~4x raw throughput and constant memory scaling.
  2. True Unicode Support: We use Rust's robust unicode-segmentation to correctly identify word boundaries in any language (Chinese, Japanese, Korean, Thai, Hindi, etc.).
  3. Expanded Features:
    • Fuzzy Matching: Levenshtein distance support for extracting slightly misspelled keywords.
    • Mixed Case Mode: Support simultaneous case-sensitive and case-insensitive keywords.
    • Rich Metadata: Extract detailed span (start/end) and replacement information.
  4. Drop-in Replacement: You can switch from flashtext by changing one line of code (or just the install command).

Version History

v4.0.0 (The Rust Era) - Alpha

  • Rust Integration: Core logic rewritten in Rust for speed and safety.
  • New Features: Fuzzy matching, JSON file loading, sentence extraction.
  • Universal Wheels: Pre-compiled binaries for Windows, macOS (Intel/Silicon), Linux (gnu/musl) and Aarch64.
  • Compatibility: 100% Drop-in replacement for Python API.

New Features

  • International Word Boundaries: Unicode-aware boundary detection.
  • Load Keywords from File: Support for JSON/Text files.
  • Mixed Case Support: Case-sensitive and Case-insensitive coexistence.
  • Fuzzy Matching: Optional Levenshtein support.
  • New APIs: Extract sentences, replacement metadata.

v3.0.0 (Python Core) - Released

  • Unicode case folding: Correct spans for Turkish İ and German ß
  • Numbers: Keywords followed by numbers are now extracted correctly
  • CJK Support: Adjacent keywords (Chinese/Japanese) now extracted correctly

Feature Highlights

International Word Boundaries (v4.0)

The original FlashText only supported ASCII characters (A-Za-z0-9_) as word parts. This caused issues for many languages where characters like é, ß, or ç were treated as delimiters, breaking words apart.

v4.0 Fix: All valid Unicode alphanumeric characters are now treated as part of a word by default.

# Hindi (Devanagari)
kp.add_keyword('नमस्ते')
kp.extract_keywords('नमस्ते दुनिया') 
# ✅ ['नमस्ते'] (Previously failed)

# French/German
kp.add_keyword('café')
kp.extract_keywords('I went to a café.') 
# ✅ ['café'] (Previously extracted 'caf')

CJK Adjacent Keywords (v3.0)

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛')  # Estée Lauder
kp.add_keyword('小棕瓶')    # Advanced Night Repair

text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛']  ❌ Missing '小棕瓶'
# FlashText i18n:     ['雅詩蘭黛', '小棕瓶']  ✅ Both extracted!

Loading Keywords from File (v4.0)

You can now load keywords directly from JSON or text files.

# keywords.json
# {
#    "Color": ["red", "blue", "green"],
#    "Vehicle": ["car", "bike"]
# }

kp.add_keywords_from_file('keywords.json')

Installation

pip install flashtext-i18n

Note: This package provides a drop-in replacement module named flashtext. Please uninstall the original flashtext package first to avoid conflicts.

pip uninstall -y flashtext
pip uninstall -y flashtext-i18n # optional cleanup
pip install -U flashtext-i18n

Or using uv:

uv pip install flashtext-i18n

Or install from GitHub:

pip install git+https://github.com/termdock/flashtext-i18n.git

Usage

The API is 100% compatible with the original FlashText:

from flashtext import KeywordProcessor

# Create processor
kp = KeywordProcessor()

# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')

# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']

# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]

# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'

# Get replacement details (New in v4.0)
new_text, replacements = kp.replace_keywords(text, span_info=True)
# replacements = [
#     {'original': 'Python', 'replacement': 'Python', 'start': 7, 'end': 13},
#     {'original': '機器學習', 'replacement': 'Machine Learning', 'start': 18, 'end': 22}
# ]


# Extract sentences with keywords (New in v4.0)
sentences = kp.extract_sentences(text)
# [('I love Python and 機器學習', ['Python', 'Machine Learning'])]

# Get keyword count
print(len(kp))
# 2

# One keyword matching multiple Tags (New in v4.0)
kp.add_keyword('Apple', ['Fruit', 'Tech'])
keywords = kp.extract_keywords('I have an Apple')
# ['Fruit', 'Tech']

# Mixed Case Support (Case-Sensitive & Case-Insensitive) (New in v4.0)
# Default: case_sensitive=False (Global)
kp = KeywordProcessor()

# Add a case-insensitive keyword (matches 'banana', 'Banana', 'BANANA')
kp.add_keyword('banana')

# Add a case-sensitive keyword (matches 'Apple' ONLY)
kp.add_keyword('Apple', case_sensitive=True)

keywords_found = kp.extract_keywords('I like Apple and Banana.')
# ['Apple', 'banana']

keywords_found = kp.extract_keywords('I like apple and BANANA.')
# ['banana'] (Strict 'Apple' does not match 'apple')

> **Note**: **Shared Trie Path Tradeoff**. If you add `Apple` (Case-Sensitive) and `apple` (Insensitive), they share the path a-p-p-l-e. The last definition wins. **Recommendation**: Add case-sensitive keywords *after* case-insensitive ones if strict separation is needed.

### Fuzzy Matching (Levenshtein Distance)

FlashText supports fuzzy matching to handle typos.

> **Warning**: Fuzzy matching introduces additional Levenshtein distance calculation overhead, making it **significantly slower** than exact matching. Use only when necessary.

Use `max_cost` to specify the maximum allowable Levenshtein distance.

```python
kp = KeywordProcessor()
kp.add_keyword('Machine Learning')

# Exact match
kp.extract_keywords('I love Machine Learning')
# ['Machine Learning']

# Fuzzy match (max_cost=2) -> Matches "Mchine Larning" (2 deletions)
kp.extract_keywords('I love Mchine Larning', max_cost=2)
# ['Machine Learning']

# Fuzzy match for CJK (New in v4.0)
kp.add_keyword('人工智慧')
# Matches "人工智障" (1 substitution)
kp.extract_keywords('這有人工智障功能', max_cost=1)
# ['人工智慧']

Performance (v4.0 Rust Core)

Comparison of FlashText 4.0 (Rust), FlashText 3.0 (Python), and Regex (compiled).

Benchmark Methodology

  • Corpus: 10,000 lines (Short sentences, simulated natural language).
  • Terms: 1,000 to 100,000 unique keywords.
  • Metric: Median Match Time (Seconds) over 10 iterations (Warmup enabled).
  • Environment: Apple Silicon (M1/M2/M3), Python 3.11.

Results: Keyword Extraction Time (Lower is Better)

Keywords Rust (v4.0) Python (v3.0) Regex Speedup (vs Py) Speedup (vs Re)
1,000 0.012s 0.043s 0.92s 3.6x 76x
5,000 0.013s 0.042s 4.80s 3.2x 369x
20,000 0.018s 0.046s 19.16s 2.6x 1064x
100,000 0.021s 0.056s N/A 2.7x N/A

Note: Rust match latency remains nearly constant as keyword count scales from 1k to 100k (on this corpus). Regex performance degrades sharply as the number of alternations grows, making it unsuitable for large keyword sets. Rust reduces per-character overhead and memory allocations, resulting in a consistent 2.6x to 3.6x speedup over the Python implementation.

Match Time (Figure 1: Comparison vs Regex - Rust is 1000x faster)

Match Time Rust vs Python (Figure 2: Comparison vs Python - Rust is ~3x faster and scales better)

Build Time (Index Construction)

Keywords Rust (v4.0) Python (v3.0)
100,000 0.08s 0.17s

Rust constructs the keyword trie index 2x faster than Python. (Build time measured on the same machine, release build, 10 iterations)

Roadmap

See Issues for planned fixes:

  • Unicode case folding span fix (Turkish İ, German ß) (Fixed in v3.0.0)
  • Keywords followed by numbers extraction (Fixed in v3.0.0)
  • Internationalized word boundary detection (Fixed in v4.0)
  • Indian languages (Devanagari) support (Fixed in v4.0)
  • Load keywords from JSON/Text file (Fixed in v4.0)

Credits

This project is a fork of FlashText created by Vikash Singh.

The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale

License

MIT License - see LICENSE file.

The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flashtext_i18n-4.0.0a12-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (338.2 kB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

flashtext_i18n-4.0.0a12-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl (340.6 kB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

flashtext_i18n-4.0.0a12-cp38-abi3-win_amd64.whl (204.4 kB view details)

Uploaded CPython 3.8+Windows x86-64

flashtext_i18n-4.0.0a12-cp38-abi3-musllinux_1_2_x86_64.whl (551.3 kB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_x86_64.whl (346.6 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_aarch64.whl (340.3 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

flashtext_i18n-4.0.0a12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (608.1 kB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file flashtext_i18n-4.0.0a12-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 544e765d64556dcb40891cd53fbb8a0509eadf04c896fb805e6fe91682bba23f
MD5 1c41825a08d4a0c512c236fb1bf3dcc3
BLAKE2b-256 2e890fd03a7b27ab868b982ccad1c83b331ec864f4c05bd4f500f0c728c33d39

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4eab74eed688324a72b0c97eed387a2f36e2f90c9133374c8132415f8c5e0757
MD5 5be2e8177d1f28e29bd0fad346cbd3a5
BLAKE2b-256 38ef361cdf366bfa1fcfc475557de8a443f65a50a1169cd61f397cea93358baf

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d74ddfbc452af49e1d4ad01a4004ea34068a8e0499fadbccab01acd0abd26fd6
MD5 c2352fdecaca5de06632610e0564e932
BLAKE2b-256 3b3d6982c52ea714c3218095954d3c44cf217a12e5879af9c62f770a6c17b400

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-cp38-abi3-win_amd64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a824b0bc070362b53dabdb15c154a88b3fa7a8f21cc2308d3963fa5528969750
MD5 43799f4d9c10ff1351dc3f08e24a48e4
BLAKE2b-256 feba5e5ec6c0840ce5672b0ff25deb5eac8a626b52dec9964d33a74e4fa981ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a8c75bc1766fb39180c1d7000b5814f799442d9fa8c9f66a9083a8d4a01d085d
MD5 2d4642bb3c70f05319c30c009b624a40
BLAKE2b-256 08eac1798bb8c8ea218d5aa6a41d810c148ef0b78fef1b992add7e1ba140664e

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_x86_64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b4b57370b880531191a40e6092df61edfd627adc84af02e49561362fc49f43ce
MD5 d1f01cd3a9e7a72f56ac73905d16cd67
BLAKE2b-256 c9aa3c024dce7df5e49f12733accc3f24cee143ea6f00298c5ec9fe432790771

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-cp38-abi3-manylinux_2_28_aarch64.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-4.0.0a12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for flashtext_i18n-4.0.0a12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 0bcb5b483989e14cdb220064a6211b981f22ad63f9c4c29d8f4dc725d520d042
MD5 3313061f4e64b2a7309ae8a67be8d5ae
BLAKE2b-256 c4e2c237c885b3ba8998c081770694b2790c7467570aedd5ab53c6966498c9ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-4.0.0a12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page