Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d45b8e09c1349a324a468d0c5a696b615c23f86f1d8fbfd0da62ec9110432ef |
|
MD5 | 91ffb754d418978d50d80487ff6e855e |
|
BLAKE2b-256 | 91a13f036451656b27037057a02b9d55655ddce5e8dec292ad38b6f5d818806e |
Close
Hashes for streamcorpus_pipeline-0.7.5-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 845a2a91ce298175b9e62c7f904276a1aa9f0f00981b9edf50cc8c7b4a84c7e5 |
|
MD5 | 007bb20ba3e1e8080498a61142ddc7f6 |
|
BLAKE2b-256 | 2115bba522db42c1f3e7e397ae0009a14091cbb930855eb4c06dd362b198cc90 |