Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.6.dev9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7b4bd2bee736b38f96d31c33e984cb92ffd3f1515bcf83d720aa5fc8de8a4fe |
|
MD5 | bc51ccace815362c3cfc8e25ba1e427b |
|
BLAKE2b-256 | 9c2bd4859bcd6e1d7051730801c87aef388c17511dad15b2acb04caca3d60e4d |
Close
Hashes for streamcorpus_pipeline-0.5.6.dev9-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | e666dfffa63e98dd8459122de00f4fb6280006ca8a8c8d04037f1352cf8a416d |
|
MD5 | 9456e4ec03c7925fbcedcd3e534c5871 |
|
BLAKE2b-256 | 9a695ef40fa3474834e8ca77d62797971bd9eaca7ba130bc9459e22132df10fd |