Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfb4968fac2215200813696ae8ccd7723e3b7a3338219c981681a2d1db252cca |
|
MD5 | 33a268486a133144ad5800cec9d06b66 |
|
BLAKE2b-256 | 663b049c450c3f7a8e9b0f647cc3af099372112c42cd53c736cd96c3a04eead7 |
Close
Hashes for streamcorpus_pipeline-0.6.3-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55a21a2673eb8b4bd205d240ad47a4cffa5f8a898d35f450e7f26b563b1bd450 |
|
MD5 | ddbbe986855338528662e32b356ba25e |
|
BLAKE2b-256 | e8b13537d10533b52f74d246964ae0c9a756125f44748e099a0695162502910b |