Reproducible Experimentation for Computational Linguistics Use
Project description
Recluse
Author: L. Amber Wilcox-O’Hearn
Contact: amber@cs.toronto.edu
Released under the GNU AFFERO GENERAL PUBLIC LICENSE, see COPYING file for details.
Introduction
Recluse (Reproducible Experimentation for Computational Linguistics Use) is a set of tools for running computational linguistics experiments reproducibly.
This version contains
utils, which has a function for reading and writing unicode with regular or compressed text.
article_randomiser, which reproducibly randomly divides a corpus into training, development, and test sets.
nltk_based_segmenter_tokeniser, which does sentence segmentation and word tokenisation. It is optimised for Wikipedia type text, and it has a mode that preserves the untokenised text (modulo extra whitespace).
vocabulary_generator and the helper class vocabulary_cutter. This wraps srilm as it makes unigram counts, and then selects the most frequent.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.