Skip to main content

A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package

Project description

flashtext 2.0

FlashText rewritten from scratch.


PyPi Downloads Downloads

You can get the package directly from PyPI

pip install flashtext2

Current state of the project: I'm currently working on implementing the core in Rust, which will make the benchmarks 3 - 10x faster (Have a look at the pyo3 branch).

flashtext is great, but wouldn't it be nice if the code was much simpler, so instead of this:

def extract_keywords(self, sentence, span_info=False):
    keywords_extracted = []
    if not sentence:
        # if sentence is empty or none just return empty list
        return keywords_extracted
    if not self.case_sensitive:
        sentence = sentence.lower()
    current_dict = self.keyword_trie_dict
    sequence_start_pos = 0
    sequence_end_pos = 0
    reset_current_dict = False
    idx = 0
    sentence_len = len(sentence)
    while idx < sentence_len:
        char = sentence[idx]
        # when we reach a character that might denote word end
        if char not in self.non_word_boundaries:

            # if end is present in current_dict
            if self._keyword in current_dict or char in current_dict:
                # update longest sequence found
                sequence_found = None
                longest_sequence_found = None
                is_longer_seq_found = False
                if self._keyword in current_dict:
                    sequence_found = current_dict[self._keyword]
                    longest_sequence_found = current_dict[self._keyword]
                    sequence_end_pos = idx
                    
    # and many more lines ... (89 lines in total)

We would have this:

def _extract_keywords_iter(self, sentence: str) -> Iterator[str]:
    if not self._case_sensitive:
        sentence = sentence.lower()

    words: list[str] = list(itertools.chain(self.split_sentence(sentence), ('',)))
    n_words = len(words)
    keyword = self.keyword
    trie = self.trie_dict
    node = trie

    last_kw_found: str | None = None
    n_words_covered = 0
    idx = 0
    while idx < n_words:
        word = words[idx]

        n_words_covered += 1
        node = node.get(word)
        if node:
            kw = node.get(keyword)
            if kw:
                last_kw_found = kw
        else:
            if last_kw_found is not None:
                yield last_kw_found
                last_kw_found = None
                idx -= 1
            else:
                idx -= n_words_covered - 1
            node = trie
            n_words_covered = 0
        idx += 1

Much more readable, right? Not only is this more readable and concise, it is also more performant (and consistent), more about that later.

Other than rewriting all the functions with simpler, shorter, and more intuitive code, all the methods and functions are fully typed.

Performance

Simplicity is great, but how is the performance?

I created some benchmarks which you could find here, and it turns out that both for extracting and replacing keywords it is faster than the original package:

Extracting keywords: Image

Replacing keywords: Image


Quick Start

Import and initialize the class:

>>> from flashtext2 import KeywordProcessor
>>> kp = KeywordProcessor()

Add a bunch of words:

>>> kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'hey'})

The dictionary keys represent the words that we want to search in the string, and the values are their corresponding 'clean word'.

Check how many words we added:

>>> len(kp)
3

We can see how the key/values are stored in the trie dict:

>>> kp.trie_dict
{'py': {'__keyword__': 'Python'},
 'go': {'__keyword__': 'Golang'},
 'hello': {'__keyword__': 'hey'}}

One major change in FlashText 2.0 is that the keywords are separated by words and non-words groups instead of characters. For example, if you were to add the keyword/sentence "I love .NET" it would be stored like this:

kp2 = KeywordProcessor()
kp2.add_keyword("I love .NET")  # not actually :)
>>> kp2.trie_dict
{'i': {' ': {'love': {' ': {'': {'.': {'net': {'__keyword__': 'I love .NET'}}}}}}}}

Extracting Keywords

from flashtext2 import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict({'py': 'Python', 'go': 'Golang', 'hello': 'Hey'})

my_str = 'Hello, I love learning Py, aka: Python, and I plan to learn about Go as well.'

kp.extract_keywords(my_str)
['Hey', 'Python', 'Golang']

Replace Keywords

kp.replace_keywords(my_str)
'Hey, I love learning Python, aka: Python, and I plan to learn about Golang as well.'

Acknowledgements

Credit goes to the original FlashText package author; Vikash Singh, and to decorator-factory for optimizing the algorithm.

What's next

Stay tuned! In the future version I will implement the whole algorithm in Rust, which other than being as fast as C, it consumes very little memory usage even on very large strings, and it makes it easy to parallelize code to take advantage of all cores.

  • Optimize the extract_keywords() algorithm
  • Experiment with Cython to speed up everything
  • Add a selection algorithms for extracting different things (words, substrings, sentences, etc.)
  • Improve tests
  • Experiment with multiprocessing to improve performance on very large strings

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtext2-1.0.0.tar.gz (144.7 kB view details)

Uploaded Source

Built Distributions

flashtext2-1.0.0-cp38-abi3-win_amd64.whl (156.3 kB view details)

Uploaded CPython 3.8+ Windows x86-64

flashtext2-1.0.0-cp38-abi3-win32.whl (150.2 kB view details)

Uploaded CPython 3.8+ Windows x86

flashtext2-1.0.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

flashtext2-1.0.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ s390x

flashtext2-1.0.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ppc64le

flashtext2-1.0.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.2 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARMv7l

flashtext2-1.0.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

flashtext2-1.0.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.5+ i686

flashtext2-1.0.0-cp38-abi3-macosx_11_0_arm64.whl (285.4 kB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

flashtext2-1.0.0-cp38-abi3-macosx_10_12_x86_64.whl (286.9 kB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file flashtext2-1.0.0.tar.gz.

File metadata

  • Download URL: flashtext2-1.0.0.tar.gz
  • Upload date:
  • Size: 144.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.4.0

File hashes

Hashes for flashtext2-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bb0793a1da2fc507a49006693be920bf3b3d7e63199f34bf8681347d50f87c9d
MD5 85ad9245982497fed6926e4835aa48d7
BLAKE2b-256 f51139c9df233e9893e6529eb91c21bef777c24daee5ee6a0af46ae16eb2f54f

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ba9c1b774da547f06194517afe894217c391b854cc0707c98a454936abf211e8
MD5 954c88b23eadc7f16e4fb82243903776
BLAKE2b-256 40d8967d748c6727cd46821bf13e9adcdc4663d8296868f0f6d96b24b0a69177

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-win32.whl.

File metadata

  • Download URL: flashtext2-1.0.0-cp38-abi3-win32.whl
  • Upload date:
  • Size: 150.2 kB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.4.0

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 8b8d3f5354be13771b7ebfe3788bb010db3928e6a42673288555bbe96feab008
MD5 cfffb92c521b0ae7618129ad8e1f5d1e
BLAKE2b-256 1ddadeb09a007d489af331765aa874f3ffd9792fccdbbaee142cd216817d2fa8

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5fd2f96214a64026e71ca538d476ff0f003ebe5f10642e03f7af4eaeeb9a796e
MD5 ac84cb815898d7b4ddab6018ce4ff201
BLAKE2b-256 5f80efa4b1fe3a3a88736de4106105c2b338c71afc485515fefb9535dd12ba94

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 c238a6d6a43c19ab8a52d38fcbd38383be8936f41afc9403c6da25e6ebabc439
MD5 2c8dfc3b8c221fcdf499de3bafc2b8c1
BLAKE2b-256 d4dc7377016861b9ac0a2e944e2183e605007441b1dcc9c64622352cf1aa3ee8

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 e1dce2c684b236c080b3fd7bbc5e22d953ba38959821b65968b1933ffdde8999
MD5 28e1f6725d272ab3c4bdaca1d83ed23f
BLAKE2b-256 a331acb3107a4d444bf3b998c1cbed8e21e32bfa67a5218f432db27abc54ad62

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 5b170fa9c0942852467bdcaaa76be9afd25f07af9c0b9f1e31c8f47cc8f9d1b6
MD5 74ec3e45fae39fd937c31ef04577f158
BLAKE2b-256 1ee5fe4db7808311c304759e45b87424340597ce96cdac94829ca433fc112cbe

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4039cbb401f2c18a27c3870e9f6830c678c7cc45df0ae69ef6f585cfb5b810db
MD5 7989b643c244d14f83048bdb81299414
BLAKE2b-256 05abd285158634cbe75ae12ff817dd05fbcca66da02bb6e8092d65cb724713f3

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 b254d648912e8be9617b13f1adbb99368886d7dbda1995775bb17702d7d79c6c
MD5 1b5ca4eae0a9bcb6a7c33a2cfbaf59ac
BLAKE2b-256 4457960913f26b51e70d440d7cc9a339a246a88b0314713b7ee04a59325f7772

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 220cb560fba4a3ce2867cfe484bc8db1da40cac7bb993879a7baa020eea6ca84
MD5 f21b530b636ddc999d507187a7146da1
BLAKE2b-256 c7678cfd120f06b5285200bc6d4688d4be1abe6614be0bf79fa465716b7cf6a5

See more details on using hashes here.

File details

Details for the file flashtext2-1.0.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for flashtext2-1.0.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4d2046c169e327df48fa44087101931453ec001a002c1cc3622371fe2e6db509
MD5 1da748333d8b8c9f642a9a51a752ed93
BLAKE2b-256 2ce9df62c5e5384ecd66118d15397fe5c584a3dab69231cedbb59fd0a69a0e5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page