Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.23.dev7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f0b07c53d244db38afef70e2d92407ed90265489a36a2d80ee1149101c4e98c |
|
MD5 | 12a9e05c3c35bbe1c78633868fa3bd05 |
|
BLAKE2b-256 | 1449a6ac35e9a256f1773592ec31db08ea03df468f81f82b9d55231f852cf1ea |
Close
Hashes for streamcorpus_pipeline-0.5.23.dev7-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95a166b6a9af9dcb5eaa512ae1d1cfd1542d41dbee215a1f93aff4cdb0c604d5 |
|
MD5 | 8f2153d1e53c85dbbb4bb13352b08626 |
|
BLAKE2b-256 | fd41d334137311f50a6730e996da0c3153f2ae9032c84c50cd6f9edf9cf2034f |