Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.34.dev2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b82eecc55b9cd17f0be32b9fa8789dc06f36d5d5241fa2a92a94c15fcba0743e |
|
MD5 | 33eca78ce7c030e2b0272d31e93731f0 |
|
BLAKE2b-256 | c3007ceff23d04aff660fe7fc84babc31cda8f36aa2a98e7626d2fbe2d42db0a |
Close
Hashes for streamcorpus_pipeline-0.5.34.dev2-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6be8ec6b38be1d7b3e65b4ce309bbd6d2ab6098e850b1b8a61b7138a230bd0af |
|
MD5 | 5fc153c7c6ec15d6a0e14a9377801d99 |
|
BLAKE2b-256 | 773f2b7b1b5a8a12e7d72a9b0e82ac147220f9b4644815f1306ba413505a734e |