English lengthened expression normalizer (e.g., coooolllll!!! -> cool!)
Udon is a text normalizer for lengthened English expression having repeating letters.
(e.g., Udon converts “cooooooooooooooollllllllllllll” to “cool”)
This module is based on the following paper:
Samuel Brody and Nicholas Diakopoulos. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In EMNLP2011, pp. 562-570, 2011.
$ pip install udon
>>> import udon
>>> udon.normalize_sentence('you are coooolll!!!') you are cool!
>>> udon.normalize_word('okayyyyy') okay
Shorten repeated substring until threshould without dictionary
>>> udon.cut_repeat('mamisaaaaaan', 1) mamisan >>> udon.cut_repeat('okayyyyy', 2) okayy
- cut_repeat(str, threshould)
- Note that this method don’t use a lengthened expression normalize table (e.g., cooll -> cool). If you want to normalize such expression, use normalize_word() or normalize_sentence() method.
- Support Japanese lengthened expressions
Contributions are welcome!
- This module is licensed under MIT License.
Available on Python 3.x