Skip to main content

A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package

Project description

PyPi Version Supported Python versions Downloads Downloads

pip install flashtext2

flashtext2

flashtext2 is an optimized version of the flashtext library for fast keyword extraction and replacement. Its orders of magnitude faster compared to regular expressions.

Key Enhancements in flashtext2

  • Rewritten for Better Performance: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
  • Unicode Standard Annex #29: Instead of relying on arbitrary regex patterns like flashtext does: [A-Za-z0-9_]+, flashtext2 uses the Unicode Standard Annex #29 to split strings into tokens. This ensures compatibility with all languages, not just Latin-based ones.
  • Unicode Case Folding: Instead of converting strings to lowercase for case-insensitive matches, it uses Unicode case folding, ensuring accurate normalization of characters according to the Unicode standard.
  • Fully Type-Hinted API: The entire API is fully type-hinted, providing better code clarity and improved development experience.

Usage

Click to unfold usage

Keyword Extraction

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')

text = "I love programming in Python and using the flashtext library."

keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']

keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]

Keyword Replacement

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')

text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)

print(new_text)
# Output: "I love programming in Python and using the flashtext library."

Case Sensitivity

from flashtext2 import KeywordProcessor

text = 'abc aBc ABC'

kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')

print(kp.extract_keywords(text))
# Output: ['aBc']

kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')

print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']

Other Examples

Overlapping keywords (returns the longest sequence)

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')

text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']

Case folding

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])

text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']

Performance

Click to unfold performance

Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks here.

Image

Image

The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.

TODO

Click to unfold TODO
  • Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
  • Remove all clones in src code

Credit to Vikash Singh, the author of the original flashtext package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext2-1.1.0.tar.gz (109.5 kB view details)

Uploaded Source

Built Distributions

flashtext2-1.1.0-cp38-abi3-win_amd64.whl (163.3 kB view details)

Uploaded CPython 3.8+ Windows x86-64

flashtext2-1.1.0-cp38-abi3-win32.whl (153.4 kB view details)

Uploaded CPython 3.8+ Windows x86

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (296.3 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (342.2 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ s390x

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (336.7 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ppc64le

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (303.5 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARMv7l

flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (303.7 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl (307.3 kB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.5+ i686

flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl (256.3 kB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl (264.5 kB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file flashtext2-1.1.0.tar.gz.

File metadata

  • Download URL: flashtext2-1.1.0.tar.gz
  • Upload date:
  • Size: 109.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.6.0

File hashes

Hashes for flashtext2-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2eb9d8c5400f59321e0c64e52bbbd310a2cab32b4269f2fc3c824f6e0c3320a3
MD5 5f2aa365475ec615a29a495b3fe51b1f
BLAKE2b-256 b52ea29841b65523bfb25dfe10e13a89e483edd6c9fc17e48d3c5f0b12e9d33d

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2049a62895a23db9b385e295f6bce16e6fb5e60444a1943d9978029683fc184c
MD5 5cdaf03c65c6c5564593682f4f96381a
BLAKE2b-256 74f0039c9ee320f3581b0405efae9160167e7d19a8f0a22d865fe1c40c20ba11

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-win32.whl.

File metadata

  • Download URL: flashtext2-1.1.0-cp38-abi3-win32.whl
  • Upload date:
  • Size: 153.4 kB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.6.0

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 f3c062a52f14840de76daa6180eb035d7f80151b7111bd419bfb34e2ced3cf24
MD5 1823e791db9245782b248b82f6cccf0a
BLAKE2b-256 ad37bf09fcf995c5b51e11be1e7e3c71022deea6337e97248121f265fbbeac84

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 06cd7787ac8b497f5725e76e168b4cbe8ddb96ef488315aea1fb23476071199a
MD5 7f9f57f22ee1bbbc07c92fccbab243f0
BLAKE2b-256 a03a0f591aede29ec711360206e394071ea6e8902b0eda8c62969b9e119d1574

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 db9f11b0c0debca21b7691e89912f188bad560f856aa9caa4b127ff4c898512a
MD5 f5098d7782b04210f2134e6c87bdf4e6
BLAKE2b-256 8f5c1ffd609ac91b7b357d226ae285789ecbd8ee3a3c19d997df0b44733b4c43

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 27113d3fd260b69dccd50611fd38fda792dc16bab2e938e60959e4a655ffd602
MD5 312897656f609cafcab194037f0a28aa
BLAKE2b-256 8f3c2f17a29d5e89c4f3f1c8eb98ce3ac79f4d7704aa35c0ea8d8474bc5d2258

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 92cd3e73f1ce5d00eefdb564540d70cd2a40d71a2f79662823f125908aa34a5c
MD5 24dc39a09002dba09d3c75e079959ab5
BLAKE2b-256 05f1802189fb11d42e161d599013dab47724400b3df82a512ab62ca0aa2b3480

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cffcfe5b1ea4d5c4326cd4de2fab9a54dc2c2c7b7f6bea27521ee7138f5b0ad2
MD5 d599ef78023a81f2a229f3f8df20820b
BLAKE2b-256 ec0a35caec42823627d9c50b3099b46310ef09d11d2466ad18a55ffb1f251cea

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 2811d7205bed2e1562b0b96c7540b12c92eb1a2236d0aade2f4f6304b2db67b8
MD5 c68e6099e890a3e2d4e1c88f4124c65e
BLAKE2b-256 b5abbc02ffd8d84cf646e9e5ed98fe5542f847f7e48bf11f878dbff667560841

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d08bcd69a2f7ab8e18b156e5bfe8a9e7539037ceb6e4b8b9dced77c59a703883
MD5 20624873d15269f3767cf5ec1efef067
BLAKE2b-256 8d51454fc720e25d82c09f5efca82c0da45879fcba1bf373c3649c47f9bbde59

See more details on using hashes here.

File details

Details for the file flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e5e766210be8457938f44b85d43c8b9ef7fc6cdb5a0503c1cd5bf29326511f8d
MD5 2cf546c40211e6555023f2acf763d276
BLAKE2b-256 4567b16e333177cf9927f11e19c06359f0a03e225efe1c4cc47531898c0195d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page