Skip to main content

Extract/Replace keywords in sentences. Fork with internationalization fixes for CJK and Unicode.

Project description

FlashText i18n

A maintained fork of FlashText with internationalization and Unicode fixes.

PyPI version License: MIT

Why This Fork?

The original FlashText is no longer actively maintained and has several bugs with international text:

  • CJK languages: Adjacent keywords not extracted (Chinese, Japanese, Korean)
  • Unicode case folding: Wrong span positions for characters like Turkish İ
  • Non-ASCII boundaries: Various edge cases with international characters

This fork aims to fix these issues while maintaining full API compatibility.

Fixed in v3.0.0

CJK Adjacent Keywords

from flashtext import KeywordProcessor

kp = KeywordProcessor()
kp.add_keyword('雅詩蘭黛')  # Estée Lauder
kp.add_keyword('小棕瓶')    # Advanced Night Repair

text = '推薦雅詩蘭黛小棕瓶超好用'
result = kp.extract_keywords(text)
# Original FlashText: ['雅詩蘭黛']  ❌ Missing '小棕瓶'
# FlashText i18n:     ['雅詩蘭黛', '小棕瓶']  ✅ Both extracted!

Installation

pip install flashtext-i18n

Or install from GitHub:

pip install git+https://github.com/termdock/flashtext-i18n.git

Usage

The API is 100% compatible with the original FlashText:

from flashtext import KeywordProcessor

# Create processor
kp = KeywordProcessor()

# Add keywords
kp.add_keyword('Python')
kp.add_keyword('機器學習', 'Machine Learning')

# Extract keywords
text = 'I love Python and 機器學習'
keywords = kp.extract_keywords(text)
# ['Python', 'Machine Learning']

# Extract with span info
keywords_with_span = kp.extract_keywords(text, span_info=True)
# [('Python', 7, 13), ('Machine Learning', 18, 22)]

# Replace keywords
new_text = kp.replace_keywords(text)
# 'I love Python and Machine Learning'

Performance

FlashText uses the Aho-Corasick algorithm with O(n) time complexity, making it extremely fast for keyword extraction from large texts.

Benchmark FlashText Regex
1000 keywords, 1M chars ~0.1s ~10s+

Roadmap

See Issues for planned fixes:

  • Unicode case folding span fix (Turkish İ, German ß)
  • Keywords followed by numbers extraction
  • Internationalized word boundary detection
  • Indian languages (Devanagari) support

Credits

This project is a fork of FlashText created by Vikash Singh.

The original FlashText algorithm is described in the paper: Replace or Retrieve Keywords In Documents at Scale

License

MIT License - see LICENSE file.

The original copyright belongs to Vikash Singh (2017). This fork is maintained by termdock & Huang Chung Yi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext_i18n-3.0.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashtext_i18n-3.0.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file flashtext_i18n-3.0.0.tar.gz.

File metadata

  • Download URL: flashtext_i18n-3.0.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flashtext_i18n-3.0.0.tar.gz
Algorithm Hash digest
SHA256 c0f376a9c6f30d4d0a314c7bc57d639182dc885ca3cabd53db885da19666e42f
MD5 593e9e5884805aa754b6f24e6512d61f
BLAKE2b-256 b2d7baf6bef0107b8a9cd17a50fb16a6cc456d535fff7734b26e1d319344068c

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-3.0.0.tar.gz:

Publisher: publish.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashtext_i18n-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: flashtext_i18n-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flashtext_i18n-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b47444e40b90a53e3d4b5369e96c74f5d08435b14233005ffc25021b1689702
MD5 280901431f83f2e3348b01471eab238b
BLAKE2b-256 34ffd9e696f8ba6d65f5679426d10c801a37f90252000b9a9beaedad89afd0aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashtext_i18n-3.0.0-py3-none-any.whl:

Publisher: publish.yml on termdock/flashtext-i18n

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page