Skip to main content

A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package

Project description

PyPi Version Supported Python versions Downloads Downloads

pip install flashtext2

flashtext2

flashtext2 is an optimized version of the flashtext library for fast keyword extraction and replacement. Its orders of magnitude faster compared to regular expressions.

Key Enhancements in flashtext2

  • Rewritten for Better Performance: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
  • Unicode Standard Annex #29: Instead of relying on arbitrary regex patterns like flashtext does: [A-Za-z0-9_]+, flashtext2 uses the Unicode Standard Annex #29 to split strings into tokens. This ensures compatibility with all languages, not just Latin-based ones.
  • Unicode Case Folding: Instead of converting strings to lowercase for case-insensitive matches, it uses Unicode case folding, ensuring accurate normalization of characters according to the Unicode standard.
  • Fully Type-Hinted API: The entire API is fully type-hinted, providing better code clarity and improved development experience.

Usage

Click to unfold usage

Keyword Extraction

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')

text = "I love programming in Python and using the flashtext library."

keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']

keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]

Keyword Replacement

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')

text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)

print(new_text)
# Output: "I love programming in Python and using the flashtext library."

Case Sensitivity

from flashtext2 import KeywordProcessor

text = 'abc aBc ABC'

kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')

print(kp.extract_keywords(text))
# Output: ['aBc']

kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')

print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']

Other Examples

Overlapping keywords (returns the longest sequence)

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')

text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']

Case folding

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])

text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']

Performance

Click to unfold performance

Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks here.

Image

Image

The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.

TODO

Click to unfold TODO
  • Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
  • Remove all clones in src code

Credit to Vikash Singh, the author of the original flashtext package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext2-1.1.0.tar.gz (109.5 kB view hashes)

Uploaded Source

Built Distributions

flashtext2-1.1.0-cp38-abi3-win_amd64.whl (163.3 kB view hashes)

Uploaded CPython 3.8+ Windows x86-64

flashtext2-1.1.0-cp38-abi3-win32.whl (153.4 kB view hashes)

Uploaded CPython 3.8+ Windows x86

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (296.3 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (342.2 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ s390x

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (336.7 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ppc64le

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (303.5 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARMv7l

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (303.7 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl (307.3 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.5+ i686

flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl (256.3 kB view hashes)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl (264.5 kB view hashes)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page