Trie-search is a package for text pattern search using marisa-trie

## Project description

Trie-search is a package for text pattern search using marisa-trie.

## Installation

$ pip install trie-search

## Usage

### Create trie dictionary

Before using this package, you need to create trie dictionary, or prepare a list of patterns.

The following example simply creates trie dictionary of
`marisa_trie.Trie` from list of article titles in English version of
Wikipedia, and saves it to `./example/triedict`.

$ cd ./example $ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz | gzcat | python create_triedict.py

**NOTICE** : This script will consume more than 2GB memory.

### trie_search.TrieSearch

Create an instance, and load dictionary:

>>> import trie_search >>> trie = trie_search.TrieSearch(filepath='./example/triedict')

If you have `list` or `tuple` object of patterns, you can create an
instance as follows:

>>> patterns = ['pattern1', 'pattern2', 'pattern3'] >>> trie = trie_search.TrieSearch(patterns)

#### TrieSearch.search_all_patterns

Search all patterns in an input text:

>>> text = ('in computer science , a trie , also called digital tree and ' ... 'sometimes radix tree or prefix tree ( as they can be searched ' ... 'by prefixes ) , is a kind of search tree - an ordered tree data ' ... 'structure that is used to store a dynamic set or associative array ' ... 'where the keys are usually strings .') >>> for pattern, start_idx in trie.search_all_patterns(text): ... print pattern, start_idx ... in 0 computer 3 computer science 3 science 12 , 20 a 22 trie 24 , 29 also 31 called 36 digital 43 ... skipped ... array 246 where 252 where the 252 the 258 the keys 258 keys 262 are 267 usually 271 strings 279

The text is the 1st sentence of https://en.wikipedia.org/wiki/Trie. For normalization, remove capitalizations and add single white space before/after symbols.

`search_all_patterns`returns an iterator. Each searched pattern is represented as a tuple`(pattern_string, pattern_start_index)`. The results are sorted by the start index. If you want to get the result as a`list`object, use`list`function as follow:>>> patterns = list(trie.search_all_patterns(text))

#### TrieSearch.search_longest_patterns

Search longest patterns in an input text:

>>> for pattern, start_idx in sorted(trie.search_longest_patterns(text), key=lambda x: x[1]): ... print pattern, start_idx ... in 0 computer science 3 , 20 a 22 trie 24 , 29 also 31 called 36 digital tree 43 and 56 sometimes 60 radix tree 70 or 81 prefix tree 84 ( 96 as 98 they 101 can 106 be 110 by 122 prefixes 125 ) 134 , 136 is a 138 kind 143 of 148 search tree 151 - 163 an 165 ordered tree data structure 168 that 196 is 201 used to 204 store 212 a 218 dynamic set 220 or 232 associative array 235 where the 253 the keys 259 are 268 usually 272 strings 280

`search_all_patterns`also returns an iterator. The result sorted by the length of patterns. In the above example, the result is re-sorted by the start index.

### trie_search.RecordTrieSearch

`trie_search.RecordTrieSearch` is a sub class of
`marisa_trie.RecordTrie`, which maps unicode keys to data tuples.

The functions, `search_all_patterns` and `search_longest_patterns`,
are also implemented in `trie_search.RecordTrieSearch`.

## Project details

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help | File type | Python version | Upload date |
---|---|---|---|

trie_search-0.3.0-py2.7.egg (6.7 kB) Copy SHA256 hash SHA256 | Egg | 2.7 | |

trie_search-0.3.0-py2-none-any.whl (6.6 kB) Copy SHA256 hash SHA256 | Wheel | py2 | |

trie_search-0.3.0-py3.5.egg (6.9 kB) Copy SHA256 hash SHA256 | Egg | 3.5 | |

trie_search-0.3.0-py3-none-any.whl (6.6 kB) Copy SHA256 hash SHA256 | Wheel | py3 | |

trie-search-0.3.0.tar.gz (4.7 kB) Copy SHA256 hash SHA256 | Source | None |