Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.37.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | be7a8a86e571c7999d6fbc760ecb91c2370964a1e26c35f9b373e8223cd4afd8 |
|
MD5 | 7ac500d8b3e2c0fb155bb42c42a1e8e7 |
|
BLAKE2b-256 | a93346b8879cdef9fde0c63e16064c9d2793940a8eeb55551ee725632ac99b99 |
Close
Hashes for streamcorpus_pipeline-0.5.37-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ab6b5b9e79d6906242be27fdc18345c54ddcd97c88147d94337eec07912a4e1 |
|
MD5 | 7330d65efa09eb3815549e424a2b64b3 |
|
BLAKE2b-256 | e79ac554c9af0feee2974be123f46935aa9d941149cf0c9935d6105c68f17b1b |