Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.3.dev8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65b73c20bf7a0d29a938800e6a13ff357f21ec5a7427d57676551790d6d9f35a |
|
MD5 | d7f2c2c6ccd957558d21ff29ddba3f53 |
|
BLAKE2b-256 | 9aa7054b189e7b594c0f64d8d86664b8505633b5d7d61ee7007496dc89ecf5f8 |
Close
Hashes for streamcorpus_pipeline-0.5.3.dev8-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25869468afd9ba911461898e456a68d308f0a566e6c50036f424d9e718b060db |
|
MD5 | 25add212f62dec6bbbce95446051d386 |
|
BLAKE2b-256 | 141c9407ef177fa21ffcb485f1e1087e8fca0b43b69c8d4b6ce32dcd1c4ddfb3 |