Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.45.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 468f6489e90515af6ee3de6a1576cc873be7413e1f4becf677815d6de15b41b5 |
|
MD5 | 57098ab1ac4cdb225240e5a96c67b7a1 |
|
BLAKE2b-256 | 9418c3b5c403411505a3dcfbc00a2a64633427af887e389189c2536328718a6a |
Close
Hashes for streamcorpus_pipeline-0.5.45-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf6cb87be1d4eecc737af34ab100de6fa3610932e78f33d11dd6fce24f9d84f3 |
|
MD5 | 490a71ae5fac4c33351affb5d97918c3 |
|
BLAKE2b-256 | 9edc0f4953f6cae07c944d0b6f5d6934bdebc6a748315edc0f2db931cbfbb6ba |