Fast text/token matching and replacement
Project description
Python matchtext
Python 3 package for fast text matching and replacing.
This library implements two fast approaches for matching keywords/gazetteer entries:
- TokenMatcher: keywords/gazetteer entries are sequences of tokens, optionally associated with some data and the matcher tries to match any of those in a given sequence of tokens.
- StringMatcher: keywords/gazetter entries are strings, optionally associated with some data and the matcher tries to match any of those in a given string, optionally only at non-word boundaries.
The matchers are implemented to be fast: TokenMatcher is a hash tree, StringMatcher uses a character trie implementation underneath. Both matchers implement additional features often required in NLP:
- return the offsets in the original iterable where a match occurs
- mapfunc: tokens/characters can be mapped to some canonical form that is used for matching
- ignorefunc: some tokens/characters can be entirely ignored for matching
- match all/longest: only match the longest entry versus all entries
- skip/noskip: if any match is found, continue matching after the longest match versus at the next position
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
matchtext-0.2.3.tar.gz
(23.8 kB
view hashes)
Built Distribution
Close
Hashes for matchtext-0.2.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 349f57276427503d9480287108e79c656a12816ddff3f986d0693b85473b414a |
|
MD5 | ecf412d9af1e386e8e83181a692370ec |
|
BLAKE2b-256 | ce41a4935eece551978ce25fdf773d215434f1db41e46ea2cae9e6d48e6223ae |