Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.32.dev17.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b2381cfec11fb4fb5d865c948a5058010ff21f58c31f1656383ecf6cbc85808 |
|
MD5 | 7c18c8847af3710645a98b83f9a85c29 |
|
BLAKE2b-256 | cb1119c9b3ed4634e33bcb9c3c2f81a3728fb319364e74068ebcd055ff18abf4 |
Close
Hashes for streamcorpus_pipeline-0.5.32.dev17-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb1a8feb7edbe54af704fea70b6cef87ac3db626bef1cee9ff142f9fc31cc4ae |
|
MD5 | 5209de969af693477c728ae8a0129268 |
|
BLAKE2b-256 | 930501b274f4aa6e6c83170caed3d2825fc8290013279531904650d94b505713 |