A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package
Project description
pip install flashtext2
flashtext2
flashtext2
is an optimized version of the flashtext
library for fast keyword extraction and replacement.
Its orders of magnitude faster compared to regular expressions.
Key Enhancements in flashtext2
- Rewritten for Better Performance: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
- Unicode Standard Annex #29: Instead of relying on arbitrary regex patterns like flashtext
does:
[A-Za-z0-9_]+
, flashtext2 uses the Unicode Standard Annex #29 to split strings into tokens. This ensures compatibility with all languages, not just Latin-based ones. - Unicode Case Folding: Instead of converting strings to lowercase for case-insensitive matches, it uses Unicode case folding, ensuring accurate normalization of characters according to the Unicode standard.
- Fully Type-Hinted API: The entire API is fully type-hinted, providing better code clarity and improved development experience.
Usage
Click to unfold usage
Keyword Extraction
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')
text = "I love programming in Python and using the flashtext library."
keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']
keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]
Keyword Replacement
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')
text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)
print(new_text)
# Output: "I love programming in Python and using the flashtext library."
Case Sensitivity
from flashtext2 import KeywordProcessor
text = 'abc aBc ABC'
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc']
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']
Other Examples
Overlapping keywords (returns the longest sequence)
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')
text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']
Case folding
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])
text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']
Performance
Click to unfold performance
Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks here.
The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.
TODO
Click to unfold TODO
- Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
- Remove all clones in src code
Credit to Vikash Singh, the author of the original flashtext
package.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file flashtext2-1.1.0.tar.gz
.
File metadata
- Download URL: flashtext2-1.1.0.tar.gz
- Upload date:
- Size: 109.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eb9d8c5400f59321e0c64e52bbbd310a2cab32b4269f2fc3c824f6e0c3320a3 |
|
MD5 | 5f2aa365475ec615a29a495b3fe51b1f |
|
BLAKE2b-256 | b52ea29841b65523bfb25dfe10e13a89e483edd6c9fc17e48d3c5f0b12e9d33d |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-win_amd64.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 163.3 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2049a62895a23db9b385e295f6bce16e6fb5e60444a1943d9978029683fc184c |
|
MD5 | 5cdaf03c65c6c5564593682f4f96381a |
|
BLAKE2b-256 | 74f0039c9ee320f3581b0405efae9160167e7d19a8f0a22d865fe1c40c20ba11 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-win32.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-win32.whl
- Upload date:
- Size: 153.4 kB
- Tags: CPython 3.8+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3c062a52f14840de76daa6180eb035d7f80151b7111bd419bfb34e2ced3cf24 |
|
MD5 | 1823e791db9245782b248b82f6cccf0a |
|
BLAKE2b-256 | ad37bf09fcf995c5b51e11be1e7e3c71022deea6337e97248121f265fbbeac84 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 296.3 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06cd7787ac8b497f5725e76e168b4cbe8ddb96ef488315aea1fb23476071199a |
|
MD5 | 7f9f57f22ee1bbbc07c92fccbab243f0 |
|
BLAKE2b-256 | a03a0f591aede29ec711360206e394071ea6e8902b0eda8c62969b9e119d1574 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
- Upload date:
- Size: 342.2 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ s390x
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db9f11b0c0debca21b7691e89912f188bad560f856aa9caa4b127ff4c898512a |
|
MD5 | f5098d7782b04210f2134e6c87bdf4e6 |
|
BLAKE2b-256 | 8f5c1ffd609ac91b7b357d226ae285789ecbd8ee3a3c19d997df0b44733b4c43 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
- Upload date:
- Size: 336.7 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27113d3fd260b69dccd50611fd38fda792dc16bab2e938e60959e4a655ffd602 |
|
MD5 | 312897656f609cafcab194037f0a28aa |
|
BLAKE2b-256 | 8f3c2f17a29d5e89c4f3f1c8eb98ce3ac79f4d7704aa35c0ea8d8474bc5d2258 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
- Upload date:
- Size: 303.5 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92cd3e73f1ce5d00eefdb564540d70cd2a40d71a2f79662823f125908aa34a5c |
|
MD5 | 24dc39a09002dba09d3c75e079959ab5 |
|
BLAKE2b-256 | 05f1802189fb11d42e161d599013dab47724400b3df82a512ab62ca0aa2b3480 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 303.7 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cffcfe5b1ea4d5c4326cd4de2fab9a54dc2c2c7b7f6bea27521ee7138f5b0ad2 |
|
MD5 | d599ef78023a81f2a229f3f8df20820b |
|
BLAKE2b-256 | ec0a35caec42823627d9c50b3099b46310ef09d11d2466ad18a55ffb1f251cea |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
- Upload date:
- Size: 307.3 kB
- Tags: CPython 3.8+, manylinux: glibc 2.5+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2811d7205bed2e1562b0b96c7540b12c92eb1a2236d0aade2f4f6304b2db67b8 |
|
MD5 | c68e6099e890a3e2d4e1c88f4124c65e |
|
BLAKE2b-256 | b5abbc02ffd8d84cf646e9e5ed98fe5542f847f7e48bf11f878dbff667560841 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 256.3 kB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d08bcd69a2f7ab8e18b156e5bfe8a9e7539037ceb6e4b8b9dced77c59a703883 |
|
MD5 | 20624873d15269f3767cf5ec1efef067 |
|
BLAKE2b-256 | 8d51454fc720e25d82c09f5efca82c0da45879fcba1bf373c3649c47f9bbde59 |
Provenance
File details
Details for the file flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 264.5 kB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5e766210be8457938f44b85d43c8b9ef7fc6cdb5a0503c1cd5bf29326511f8d |
|
MD5 | 2cf546c40211e6555023f2acf763d276 |
|
BLAKE2b-256 | 4567b16e333177cf9927f11e19c06359f0a03e225efe1c4cc47531898c0195d9 |