Trie-search is a package for text pattern search using marisa-trie
Project description
Trie-search is a package for text pattern search using marisa-trie.
Installation
$ pip install trie-search
Usage
Create trie dictionary
Before using this package, you need to create trie dictionary, or prepare a list of patterns.
The following example simply creates trie dictionary of marisa_trie.Trie from list of article titles in English version of Wikipedia, and saves it to ./example/triedict.
$ cd ./example $ curl https://dumps.wikimedia.org/jawiki/20170101/enwiki-20170101-all-titles-in-ns0.gz | gzcat | python create_triedict.py
NOTICE : This script will consume more than 2GB memory.
trie_search.TrieSearch
Create an instance, and load dictionary:
>>> import trie_search
>>> trie = trie_search.TrieSearch(filepath='./example/triedict')
If you have list or tuple object of patterns, you can create an instance as follows:
>>> patterns = [u'pattern1', u'pattern2', u'pattern3']
>>> trie = trie_search.TrieSearch(patterns)
TrieSearch.search_all_patterns
Search all patterns in an input text:
>>> text = (u'in computer science , a trie , also called digital tree and '
... u'sometimes radix tree or prefix tree ( as they can be searched '
... u'by prefixes ) , is a kind of search tree - an ordered tree data '
... u'structure that is used to store a dynamic set or associative array '
... u'where the keys are usually strings .')
>>> for pattern, start_idx in trie.search_all_patterns(text):
... print pattern, start_idx
...
in 0
computer 3
computer science 3
science 12
, 20
a 22
trie 24
, 29
also 31
called 36
digital 43
... skipped ...
array 246
where 252
where the 252
the 258
the keys 258
keys 262
are 267
usually 271
strings 279
The text is the 1st sentence of https://en.wikipedia.org/wiki/Trie. For normalization, remove capitalizations and add single white space before/after symbols.
search_all_patterns returns an iterator. Each searched pattern is represented as a tuple (pattern_string, pattern_start_index). The results are sorted by the start index. If you want to get the result as a list object, use list function as follow:
>>> patterns = list(trie.search_all_patterns(text))
TrieSearch.search_longest_patterns
Search longest patterns in an input text:
>>> for pattern, start_idx in trie.search_longest_patterns(text):
... print pattern, start_idx
...
in 0
computer science 3
, 20
a 22
trie 24
, 29
also 31
called 36
digital tree 43
and 56
sometimes 60
radix tree 70
or 81
prefix tree 84
( 96
as 98
they 101
can 106
be 110
by 122
prefixes 125
) 134
, 136
is a 138
kind 143
of 148
search tree 151
- 163
an 165
ordered tree data structure 168
that 196
is 201
used to 204
store 212
a 218
dynamic set 220
or 232
associative array 235
where the 253
the keys 259
are 268
usually 272
strings 280
search_all_patterns also returns an iterator. The result sorted by the length of patterns. In the above example, the result is re-sorted by the start index.
trie_search.RecordTrieSearch
trie_search.RecordTrieSearch is a sub class of marisa_trie.RecordTrie, which maps unicode keys to data tuples.
The functions, search_all_patterns and search_longest_patterns, are also implemented in trie_search.RecordTrieSearch.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file trie-search-0.2.0.tar.gz
.
File metadata
- Download URL: trie-search-0.2.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a872c4a7f60212df9d4db39b53b6232368113c8b2e28bf1127792002493bedac |
|
MD5 | f1993a6f7d36ed403beff92756557af1 |
|
BLAKE2b-256 | 06edcc9e081f940da1a3bd0380b928735c0c9a10ad9e42c155f54050b828d018 |
Provenance
File details
Details for the file trie_search-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: trie_search-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38f905a4ebf2ea3fa8ac2b22f38e32d7e3a4f2be89463fcf22d8825d93b4a509 |
|
MD5 | 53d1e548ce7181d569ddf80c2dacbc2b |
|
BLAKE2b-256 | def782eb8c6a7de4a79d7ad7972f6e88ceb64b146145fb364d2315747b18e98b |