Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
streamcorpus_pipeline-0.5.7.tar.gz
(455.4 kB
view hashes)
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1dda8e8c35ce631f021675e9eb9082b3c743879205852f07ed77b95010c6828 |
|
MD5 | 4986be743b46b4d149e7624d6d6e3db7 |
|
BLAKE2b-256 | 592db564e69d43db6947806def315ca9f1fa39beb91744b55337f40f81ce0249 |
Close
Hashes for streamcorpus_pipeline-0.5.7-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | f31a4619b548e1dcd7f970ba4c9cbdc30e75dd01768348bd22d302ba1d5dc127 |
|
MD5 | 283cf7f0e577cbd49b72a64e6b96b510 |
|
BLAKE2b-256 | 98ece7378c989d56516242321c618b77312002c8089ef67ccaf2b0e31a2a2722 |