Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.13.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9506319a7c0dfb603e585086a8ab8112716ae691342fed827aa3d3b84c03dee7 |
|
MD5 | 86edf254a9f713dfad697ce8d12adfc3 |
|
BLAKE2b-256 | 51e1842005d9022466ddded73dc76773cb72434ae6a5a036ff26c60715b750dd |
Close
Hashes for streamcorpus_pipeline-0.7.13-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2181cc628f096a3e960d4cf9c988b6388981a94f2864ebc5cffeb7dc6e3d6d28 |
|
MD5 | da3b16353a7b6d63ac16f9d766765982 |
|
BLAKE2b-256 | 0bdb33f4d928ae180be25717c7ba0a9ab56cbabf45735004b4042990e39336ad |