Skip to main content

Trie-search is a package for text pattern search using marisa-trie

Project description

Trie-search is a package for text pattern search using marisa-trie.

Installation

$ pip install trie-search

Usage

Create trie dictionary

Before using this package, you need to create trie dictionary, or prepare a list of patterns.

The following example simply creates trie dictionary of marisa_trie.Trie from list of article titles in English version of Wikipedia, and saves it to ./example/triedict.

$ cd ./example
$ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz | gzcat | python create_triedict.py

NOTICE : This script will consume more than 2GB memory.

trie_search.TrieSearch

Create an instance, and load dictionary:

>>> import trie_search
>>> trie = trie_search.TrieSearch(filepath='./example/triedict')

If you have list or tuple object of patterns, you can create an instance as follows:

>>> patterns = ['pattern1', 'pattern2', 'pattern3']
>>> trie = trie_search.TrieSearch(patterns)

TrieSearch.search_all_patterns

Search all patterns in an input text:

>>> text = ('in computer science , a trie , also called digital tree and '
...         'sometimes radix tree or prefix tree ( as they can be searched '
...         'by prefixes ) , is a kind of search tree - an ordered tree data '
...         'structure that is used to store a dynamic set or associative array '
...         'where the keys are usually strings .')
>>> for pattern, start_idx in trie.search_all_patterns(text):
...     print pattern, start_idx
...
in 0
computer 3
computer science 3
science 12
, 20
a 22
trie 24
, 29
also 31
called 36
digital 43
... skipped ...
array 246
where 252
where the 252
the 258
the keys 258
keys 262
are 267
usually 271
strings 279
  • The text is the 1st sentence of https://en.wikipedia.org/wiki/Trie. For normalization, remove capitalizations and add single white space before/after symbols.

  • search_all_patterns returns an iterator. Each searched pattern is represented as a tuple (pattern_string, pattern_start_index). The results are sorted by the start index. If you want to get the result as a list object, use list function as follow:

    >>> patterns = list(trie.search_all_patterns(text))

TrieSearch.search_longest_patterns

Search longest patterns in an input text:

>>> for pattern, start_idx in sorted(trie.search_longest_patterns(text), key=lambda x: x[1]):
...     print pattern, start_idx
...
in 0
computer science 3
, 20
a 22
trie 24
, 29
also 31
called 36
digital tree 43
and 56
sometimes 60
radix tree 70
or 81
prefix tree 84
( 96
as 98
they 101
can 106
be 110
by 122
prefixes 125
) 134
, 136
is a 138
kind 143
of 148
search tree 151
- 163
an 165
ordered tree data structure 168
that 196
is 201
used to 204
store 212
a 218
dynamic set 220
or 232
associative array 235
where the 253
the keys 259
are 268
usually 272
strings 280
  • search_all_patterns also returns an iterator. The result sorted by the length of patterns. In the above example, the result is re-sorted by the start index.

trie_search.RecordTrieSearch

trie_search.RecordTrieSearch is a sub class of marisa_trie.RecordTrie, which maps unicode keys to data tuples.

The functions, search_all_patterns and search_longest_patterns, are also implemented in trie_search.RecordTrieSearch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trie-search-0.3.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distributions

trie_search-0.3.0-py3.5.egg (6.9 kB view details)

Uploaded Source

trie_search-0.3.0-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

trie_search-0.3.0-py2.7.egg (6.7 kB view details)

Uploaded Source

trie_search-0.3.0-py2-none-any.whl (6.6 kB view details)

Uploaded Python 2

File details

Details for the file trie-search-0.3.0.tar.gz.

File metadata

  • Download URL: trie-search-0.3.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for trie-search-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a507c4696951fc7745f91a09c92367262c92d648ed05ee071ab8bc579d7deb10
MD5 ce98821f813772af49ce9e1c9ffcdefe
BLAKE2b-256 5ee2f14c28bda657b9ef623d2ab4bc905a6c4adadffeda945708c9fb719d601c

See more details on using hashes here.

File details

Details for the file trie_search-0.3.0-py3.5.egg.

File metadata

File hashes

Hashes for trie_search-0.3.0-py3.5.egg
Algorithm Hash digest
SHA256 eeeec95c0f320129382e71bc1da02e9945198608f1996b08249665ec09e512a6
MD5 7d5157d23bcb5a11f650578c9b29a861
BLAKE2b-256 3b81a874fedaf96290365132d515269465c2b049930b8bf965fb3438184df497

See more details on using hashes here.

File details

Details for the file trie_search-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for trie_search-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e0d66ea7057e65a0aa0bf23da1bb50f03b41c94b4752e471884f1c82b33e885
MD5 f5ce2d6415417eb2f3ba57058b063f84
BLAKE2b-256 934cd0b45b593b262ef188c187e27fb726b5ec34e06585abd685a57a09c5d576

See more details on using hashes here.

File details

Details for the file trie_search-0.3.0-py2.7.egg.

File metadata

File hashes

Hashes for trie_search-0.3.0-py2.7.egg
Algorithm Hash digest
SHA256 37435dcb1bfd69776d58be3be34fe0e04d39cb9757b0f48f329112ebe64c9ef7
MD5 1cc98fc6fc69280fd162ecb2a44bc49e
BLAKE2b-256 6ee69896cb19402dc2490a3e5505d5419bf89de99703bd16386c108944ab82c0

See more details on using hashes here.

File details

Details for the file trie_search-0.3.0-py2-none-any.whl.

File metadata

File hashes

Hashes for trie_search-0.3.0-py2-none-any.whl
Algorithm Hash digest
SHA256 e46f01ce6c2f2c610b822df3074a669a50b3967042dcfe7a91c91ed33b15e502
MD5 a8034376fc93abcc25c935dbbdb8aa81
BLAKE2b-256 ea4274d14b4911e623eb82687a39e53d6a8ab05fdf730d18da18d908576cc30c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page