Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.52.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6072ffb267d85b99eea1981f609ce3be2e8523ad98d2a87c2fb5cbe7b7f0f1db |
|
MD5 | e04987364475c8b96d97ad53667bc3d0 |
|
BLAKE2b-256 | f88961fcd2d577b944f54f661c54daf6cfe9d59526aa6aca9916d41232609453 |
Close
Hashes for streamcorpus_pipeline-0.5.52-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16b4114b2df3b3e1c25d58568c9bef80c800b157861f93c432501e70d7c2b0e0 |
|
MD5 | d44f340f36aa21fd3798223a42d57108 |
|
BLAKE2b-256 | 95482bdb1ef6b0e239ecde648c633b67bd9047d4b6b97fb5bdd8e4ef0b15b6e0 |