Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed52d6879b55b1904170d20e5e861155c57351fc7e66edc9589fb6ef2957735c |
|
MD5 | 764eb51e9cf0d48bcb3da4210ffeebe0 |
|
BLAKE2b-256 | 832d1902fb4a5229a6c7b8ac1b7c09a5d986b357e5991a160ddfaaaab2d04094 |
Close
Hashes for streamcorpus_pipeline-0.7.2-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 697d35ae731b6dd229bd6a9e70ffbb4a922609fa0e26c5a69fb34e2fc90930fc |
|
MD5 | cb63f77b1708157e811ea0518e134a95 |
|
BLAKE2b-256 | 2d9924a52d7f139d50011093a0bc29dc3fa5c723f1013f99de91a76740210a76 |