4 projects
uroman
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
wildebeest-nlp
The wildebeest scripts investigate, repair and normalize a wide range of text file problems at the character level, e.g. encoding errors, normalization of characters into their canonical form, mapping digits and some punctuation to ASCII, deletion of some non-printable characters.
greekroom
The Greek Room will be a suite of tools supporting Biblical natural language processing.
utoken
utoken is a universal tokenizer (multilingual word segmenter) that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags. It comes with a companion detokenizer.