Fast text/token matching and replacement
Project description
Python matchtext
NOTE: at the moment this is still very much in development and large parts are not implemented yet!
Python 3 package for fast text matching and replacing.
This library implements two fast approaches for matching keywords/gazetteer entries:
- TokenMatcher: keywords/gazetteer entries are sequences of tokens, optionally associated with some data and the matcher tries to match any of those in a given sequence of tokens.
- StringMatcher: keywords/gazetter entries are strings, optionally associated with some data and the matcher tries to match any of those in a given string, optionally only at non-word boundaries.
The matchers are implemented to be fast and memory-efficient: TokenMatcher is a hash tree, StringMatcher uses an efficient character trie implementation underneath. Both matchers implement additional features often required in NLP:
- mapfunc: tokens/characters can be mapped to some canonical form that is used for matching
- ignorefunc: some tokens/characters can be entirely ignored for matching
- match all/longest: only match the longest entry versus all entries
- skip/noskip: if any match is found, continue matching after the longest match versus at the next position
Tests
To run the tests manually and get any print output on the console:
PYTHONPATH=`pwd` pytest -s
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
matchtext-0.2.1.tar.gz
(19.1 kB
view hashes)
Built Distribution
Close
Hashes for matchtext-0.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cb54789550f2a4423596111ee8ae390be8d607ddc6727653de6dc5b5ad717f0 |
|
MD5 | 1818e4a33e659812e8756b584df2ab20 |
|
BLAKE2b-256 | 8742b2760f20df55db908ea2915c1ba9b57647533489593b633fc119866f7d2b |