Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.42.dev20.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff89c089df695cc0eee42ec424e921a11c575a90c32ccc02d2741482e927b141 |
|
MD5 | 8ab4645432ac8668b4f715aa9d0ec215 |
|
BLAKE2b-256 | 22e079c5897586f04d99757525eaa21ed081a6b5ea37739d5aad70b374335b64 |
Close
Hashes for streamcorpus_pipeline-0.5.42.dev20-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e5590d9a44a5ff623d3d271bb005e3814adfca66aafc159c3b28318a3166bf9 |
|
MD5 | 623d22a56098dafa09f377b92135419b |
|
BLAKE2b-256 | 869826fc7275bc130233802ccfe7737ef5293dab36b5c313cd2e94112ff2bd10 |