Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.1.dev1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | be59ae7305a6b4a10aaa972580a09c562151c4dbb5085be07c34f68c42d1e213 |
|
MD5 | 66dc0f6d0f6770606b0865496ebe9e76 |
|
BLAKE2b-256 | d3739e2f01722881f9a6c3a3d553b8d0f9f170bc5d3dae91d8178698f61c959c |
Close
Hashes for streamcorpus_pipeline-0.6.1.dev1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 366ed337c3de846c746a9af1ba0814b0b2cfc0152f841b2b886b45c9cb71dc05 |
|
MD5 | 77d3b670fd11e7f7ca0eaae95033aaab |
|
BLAKE2b-256 | 246581d6a82fa333af7fc7deb7dff5541a4cd2f261fda88a4a0f61c5ef1077b2 |