10 projects
sentence-tk-checker
Checks output of an English sentence tokenizer and modifies the output according to default or user-defined rules.
ht-getter
Searches a document for hash tags. Supports multiple natural languages. Works in various contexts.
fz-word-finder
Fuzzy match word finder. Supports multiple simultaneous target strings and fuzzy match rules. (No regex or normalization)
ko-ww-stopwords
Set of whole-word (independent) stop words in Korean.
mnl-punct-norm
Light-weight tool for removing punctuation. Supports multiple natural languages.
mnl-ws-norm
Light-weight tool for normalizing whitespace and accurately tokenizing words (no regex). Multiple natural languages supported.
back-cleaner
Server-side Python tool for escaping script tags and converting characters into HTML entity equivalents (no regex)
kr-sentence
Light-weight sentence tokenizer for Korean.
zh-sentence
Light-weight sentence tokenizer for Chinese languages.
ja-sentence
Light-weight sentence tokenizer for Japanese.