Skip to main content

A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package

Project description

FlashText 2.0


PyPi Downloads Downloads


flashtext is great, but wouldn't it be nice if the code was much simpler, so instead of this:

def extract_keywords(self, sentence, span_info=False):
    keywords_extracted = []
    if not sentence:
        # if sentence is empty or none just return empty list
        return keywords_extracted
    if not self.case_sensitive:
        sentence = sentence.lower()
    current_dict = self.keyword_trie_dict
    sequence_start_pos = 0
    sequence_end_pos = 0
    reset_current_dict = False
    idx = 0
    sentence_len = len(sentence)
    while idx < sentence_len:
        char = sentence[idx]
        # when we reach a character that might denote word end
        if char not in self.non_word_boundaries:

            # if end is present in current_dict
            if self._keyword in current_dict or char in current_dict:
                # update longest sequence found
                sequence_found = None
                longest_sequence_found = None
                is_longer_seq_found = False
                if self._keyword in current_dict:
                    sequence_found = current_dict[self._keyword]
                    longest_sequence_found = current_dict[self._keyword]
                    sequence_end_pos = idx
                    
    # and many more lines ... (89 lines in total)

We would have this:

def extract_keywords_iter(self, sentence: str) -> Iterator[tuple[str, int, int]]:
    if not self._case_sensitive:
        sentence = sentence.lower()

    words: list[str] = self.split_sentence(sentence) + ['']
    lst_len: list[int] = list(map(len, words))  # cache the len() of each word
    keyword = self.keyword
    trie = self.trie_dict
    node = trie

    last_kw_found: str | None = None
    last_kw_found_idx: tuple[int, int] | None = None
    last_start_span: tuple[int, int] | None = None
    n_words_covered = 0
    idx = 0
    while idx < len(words):
        word = words[idx]

        n_words_covered += 1
        node = node.get(word)
        if node:
            kw = node.get(keyword)
            if kw:
                last_kw_found = kw
                last_kw_found_idx = (idx, n_words_covered)
        else:
            if last_kw_found is not None:
                kw_end_idx, kw_n_covered = last_kw_found_idx
                start_span_idx = kw_end_idx - kw_n_covered + 1

                if last_start_span is None:
                    start_span = sum(lst_len[:start_span_idx])
                else:
                    start_span = last_start_span[1] + sum(lst_len[last_start_span[0]:start_span_idx])
                last_start_span = start_span_idx, start_span  # cache the len() for the given slice for next time

                yield last_kw_found, start_span, start_span + sum(
                    lst_len[start_span_idx:start_span_idx + kw_n_covered])
                last_kw_found = None
                idx -= 1
            else:
                idx -= n_words_covered - 1
            node = trie
            n_words_covered = 0
        idx += 1

Much more readable, right?
Also, other than rewriting all the functions with simpler, shorter, and more intuitive code, all the methods and functions are fully typed.

Performance

Simplicity is great, but how is the performance?

I created some benchmarks which you could find here, and it turns out that both for extracting and replacing keywords it is faster than the original package:

Extracting keywords: Image

Replacing keywords: Image


Quick Start

Import and initialize the class:

>>> from flashtext2 import KeywordProcessor
>>> kp = KeywordProcessor()

Add a bunch of words:

>>> kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'hey'})

The dictionary keys represent the words that we want to search in the string, and the values are their corresponding 'clean word'.

Check how many words we added:

>>> len(kp)
3

We can see how the key/values are stored in the trie dict:

>>> kp.trie_dict
{'py': {'__keyword__': 'Python'},
 'go': {'__keyword__': 'Golang'},
 'hello': {'__keyword__': 'hey'}}

One major change in FlashText 2.0 is that the keywords are splitted by words and non-words groups instead of characters. For example, if you were to add the keyword/sentence "I love .NET" it would be stored like this:

kp2 = KeywordProcessor()
kp2.add_keyword("I love .NET")  # not actually :)
>>> kp2.trie_dict
{'i': {' ': {'love': {' ': {'': {'.': {'net': {'__keyword__': 'I love .NET'}}}}}}}}

Extracting Keywords

from flashtext2 import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'Hey'})

my_str = 'Hello, I love learning Py, aka: Python, and I plan to learn about Go as well.'

kp.extract_keywords(my_str)
['Hey', 'Python', 'Golang']

Replace Keywords

kp.replace_keywords(my_str)
'Hey, I love learning Python, aka: Python, and I plan to learn about Golang as well.'

Acknowledgements

Credit goes to the original FlashText package author; Vikash Singh, and to decorator-factory for optimizing the algorithm.

What's next

  • Optimized the extract_keywords() algorithm
  • Experiment with Cython to speed up everything
  • Add a selection algorithms for extracting different things (words, substrings, sentences, etc.)
  • Improve tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext2-0.1.0.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

flashtext2-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file flashtext2-0.1.0.tar.gz.

File metadata

  • Download URL: flashtext2-0.1.0.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.7

File hashes

Hashes for flashtext2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7167d34f93bdd8568fc76ad5c1924c49cc9953cf2c27865e437948ba72ffd269
MD5 71291a54018af9f392dc1b20937fbc05
BLAKE2b-256 ae202d95b2ffb455ed10ffdd34b785ba85ce4a23568fbb43b7f6f96279b83e74

See more details on using hashes here.

File details

Details for the file flashtext2-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: flashtext2-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.7

File hashes

Hashes for flashtext2-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 181775f9896aebd712afdfb3b57daabc448b540c67c06676ccbb3d2b6413b2bc
MD5 fd79007d70dabab3682ae4e27212aadb
BLAKE2b-256 23ee66aa78de6efe2e73334858109878c0bc7d4e194d1ad19e1493e722388822

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page