A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package
Project description
flashtext 2.0
FlashText rewritten from scratch.
You can get the package directly from PyPI
pip install flashtext2
Current state of the project:
I'm currently working on implementing the core in Rust, which will make the benchmarks 3 - 10x faster (Have a look at the pyo3
branch).
flashtext
is great, but wouldn't it be nice if the code was much simpler, so instead of
this:
def extract_keywords(self, sentence, span_info=False):
keywords_extracted = []
if not sentence:
# if sentence is empty or none just return empty list
return keywords_extracted
if not self.case_sensitive:
sentence = sentence.lower()
current_dict = self.keyword_trie_dict
sequence_start_pos = 0
sequence_end_pos = 0
reset_current_dict = False
idx = 0
sentence_len = len(sentence)
while idx < sentence_len:
char = sentence[idx]
# when we reach a character that might denote word end
if char not in self.non_word_boundaries:
# if end is present in current_dict
if self._keyword in current_dict or char in current_dict:
# update longest sequence found
sequence_found = None
longest_sequence_found = None
is_longer_seq_found = False
if self._keyword in current_dict:
sequence_found = current_dict[self._keyword]
longest_sequence_found = current_dict[self._keyword]
sequence_end_pos = idx
# and many more lines ... (89 lines in total)
We would have this:
def _extract_keywords_iter(self, sentence: str) -> Iterator[str]:
if not self._case_sensitive:
sentence = sentence.lower()
words: list[str] = list(itertools.chain(self.split_sentence(sentence), ('',)))
n_words = len(words)
keyword = self.keyword
trie = self.trie_dict
node = trie
last_kw_found: str | None = None
n_words_covered = 0
idx = 0
while idx < n_words:
word = words[idx]
n_words_covered += 1
node = node.get(word)
if node:
kw = node.get(keyword)
if kw:
last_kw_found = kw
else:
if last_kw_found is not None:
yield last_kw_found
last_kw_found = None
idx -= 1
else:
idx -= n_words_covered - 1
node = trie
n_words_covered = 0
idx += 1
Much more readable, right? Not only is this more readable and concise, it is also more performant (and consistent), more about that later.
Other than rewriting all the functions with simpler, shorter, and more intuitive code, all the methods and functions are fully typed.
Performance
Simplicity is great, but how is the performance?
I created some benchmarks which you could find here, and it turns out that both for extracting and replacing keywords it is faster than the original package:
Extracting keywords:
Replacing keywords:
Quick Start
Import and initialize the class:
>>> from flashtext2 import KeywordProcessor
>>> kp = KeywordProcessor()
Add a bunch of words:
>>> kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'hey'})
The dictionary keys represent the words that we want to search in the string, and the values are their corresponding 'clean word'.
Check how many words we added:
>>> len(kp)
3
We can see how the key/values are stored in the trie dict:
>>> kp.trie_dict
{'py': {'__keyword__': 'Python'},
'go': {'__keyword__': 'Golang'},
'hello': {'__keyword__': 'hey'}}
One major change in FlashText 2.0 is that the keywords are separated by words and non-words groups instead of characters.
For example, if you were to add the keyword/sentence "I love .NET"
it would be stored like this:
kp2 = KeywordProcessor()
kp2.add_keyword("I love .NET") # not actually :)
>>> kp2.trie_dict
{'i': {' ': {'love': {' ': {'': {'.': {'net': {'__keyword__': 'I love .NET'}}}}}}}}
Extracting Keywords
from flashtext2 import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'Hey'})
my_str = 'Hello, I love learning Py, aka: Python, and I plan to learn about Go as well.'
kp.extract_keywords(my_str)
['Hey', 'Python', 'Golang']
Replace Keywords
kp.replace_keywords(my_str)
'Hey, I love learning Python, aka: Python, and I plan to learn about Golang as well.'
Acknowledgements
Credit goes to the original FlashText package author; Vikash Singh, and to decorator-factory for optimizing the algorithm.
What's next
Stay tuned! In the future version I will implement the whole algorithm in Rust, which other than being as fast as C, it consumes very little memory usage even on very large strings, and it makes it easy to parallelize code to take advantage of all cores.
- Optimize the extract_keywords() algorithm
- Experiment with Cython to speed up everything
- Add a selection algorithms for extracting different things (words, substrings, sentences, etc.)
- Improve tests
- Experiment with
multiprocessing
to improve performance on very large strings
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file flashtext2-1.0.0.tar.gz
.
File metadata
- Download URL: flashtext2-1.0.0.tar.gz
- Upload date:
- Size: 144.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb0793a1da2fc507a49006693be920bf3b3d7e63199f34bf8681347d50f87c9d |
|
MD5 | 85ad9245982497fed6926e4835aa48d7 |
|
BLAKE2b-256 | f51139c9df233e9893e6529eb91c21bef777c24daee5ee6a0af46ae16eb2f54f |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-win_amd64.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 156.3 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba9c1b774da547f06194517afe894217c391b854cc0707c98a454936abf211e8 |
|
MD5 | 954c88b23eadc7f16e4fb82243903776 |
|
BLAKE2b-256 | 40d8967d748c6727cd46821bf13e9adcdc4663d8296868f0f6d96b24b0a69177 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-win32.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-win32.whl
- Upload date:
- Size: 150.2 kB
- Tags: CPython 3.8+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b8d3f5354be13771b7ebfe3788bb010db3928e6a42673288555bbe96feab008 |
|
MD5 | cfffb92c521b0ae7618129ad8e1f5d1e |
|
BLAKE2b-256 | 1ddadeb09a007d489af331765aa874f3ffd9792fccdbbaee142cd216817d2fa8 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fd2f96214a64026e71ca538d476ff0f003ebe5f10642e03f7af4eaeeb9a796e |
|
MD5 | ac84cb815898d7b4ddab6018ce4ff201 |
|
BLAKE2b-256 | 5f80efa4b1fe3a3a88736de4106105c2b338c71afc485515fefb9535dd12ba94 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ s390x
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c238a6d6a43c19ab8a52d38fcbd38383be8936f41afc9403c6da25e6ebabc439 |
|
MD5 | 2c8dfc3b8c221fcdf499de3bafc2b8c1 |
|
BLAKE2b-256 | d4dc7377016861b9ac0a2e944e2183e605007441b1dcc9c64622352cf1aa3ee8 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1dce2c684b236c080b3fd7bbc5e22d953ba38959821b65968b1933ffdde8999 |
|
MD5 | 28e1f6725d272ab3c4bdaca1d83ed23f |
|
BLAKE2b-256 | a331acb3107a4d444bf3b998c1cbed8e21e32bfa67a5218f432db27abc54ad62 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b170fa9c0942852467bdcaaa76be9afd25f07af9c0b9f1e31c8f47cc8f9d1b6 |
|
MD5 | 74ec3e45fae39fd937c31ef04577f158 |
|
BLAKE2b-256 | 1ee5fe4db7808311c304759e45b87424340597ce96cdac94829ca433fc112cbe |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4039cbb401f2c18a27c3870e9f6830c678c7cc45df0ae69ef6f585cfb5b810db |
|
MD5 | 7989b643c244d14f83048bdb81299414 |
|
BLAKE2b-256 | 05abd285158634cbe75ae12ff817dd05fbcca66da02bb6e8092d65cb724713f3 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, manylinux: glibc 2.5+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b254d648912e8be9617b13f1adbb99368886d7dbda1995775bb17702d7d79c6c |
|
MD5 | 1b5ca4eae0a9bcb6a7c33a2cfbaf59ac |
|
BLAKE2b-256 | 4457960913f26b51e70d440d7cc9a339a246a88b0314713b7ee04a59325f7772 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-macosx_11_0_arm64.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 285.4 kB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 220cb560fba4a3ce2867cfe484bc8db1da40cac7bb993879a7baa020eea6ca84 |
|
MD5 | f21b530b636ddc999d507187a7146da1 |
|
BLAKE2b-256 | c7678cfd120f06b5285200bc6d4688d4be1abe6614be0bf79fa465716b7cf6a5 |
File details
Details for the file flashtext2-1.0.0-cp38-abi3-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: flashtext2-1.0.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 286.9 kB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d2046c169e327df48fa44087101931453ec001a002c1cc3622371fe2e6db509 |
|
MD5 | 1da748333d8b8c9f642a9a51a752ed93 |
|
BLAKE2b-256 | 2ce9df62c5e5384ecd66118d15397fe5c584a3dab69231cedbb59fd0a69a0e5b |