Fast text/token matching and replacement
Project description
Python matchtext
NOTE: at the moment this is still very much in development and large parts are not implemented yet!
Python 3 package for fast text matching and replacing.
This library implements two fast approaches for matching keywords/gazetteer entries:
- TokenMatcher: keywords/gazetteer entries are sequences of tokens, optionally associated with some data and the matcher tries to match any of those in a given sequence of tokens.
- StringMatcher: keywords/gazetter entries are strings, optionally associated with some data and the matcher tries to match any of those in a given string, optionally only at non-word boundaries.
The matchers are implemented to be fast and memory-efficient: TokenMatcher is a hash tree, StringMatcher uses an efficient character trie implementation underneath. Both matchers implement additional features often required in NLP:
- mapfunc: tokens/characters can be mapped to some canonical form that is used for matching
- ignorefunc: some tokens/characters can be entirely ignored for matching
- match all/longest: only match the longest entry versus all entries
- skip/noskip: if any match is found, continue matching after the longest match versus at the next position
Tests
To run the tests manually and get any print output on the console:
PYTHONPATH=`pwd` pytest -s
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
matchtext-0.2.2.tar.gz
(24.9 kB
view hashes)
Built Distribution
Close
Hashes for matchtext-0.2.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13d4db066159505c516de28aabd79a8937ae3f77db7dbb6e83c5c99ee459198b |
|
MD5 | b3fed91a923fc29e6fd73d2e83260df0 |
|
BLAKE2b-256 | ba0a22ea420e68f24a0d7c6dd774a92d09e00d6a0ea0c3e343ca1f00ac24a740 |