Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
streamcorpus_pipeline-0.5.1.tar.gz
(447.7 kB
view hashes)
Built Distribution
streamcorpus_pipeline-0.5.1-py2.7.egg
(806.5 kB
view hashes)
Close
Hashes for streamcorpus_pipeline-0.5.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | bba0a1cc641229c237f6f752e0d317e3cb82109e7e8acdc1cb59ed353625e895 |
|
MD5 | 2da0dbff3d5f1d284e2b4abbaadc3f87 |
|
BLAKE2b-256 | 14d53c6d52a3892dd966531bc708d4d261e1b8c9fb42b72dde14f30cf60c26f0 |
Close
Hashes for streamcorpus_pipeline-0.5.1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | aabf42beae3f5f31b0241f300ece7a95e3687d3201674e2340d4c2d61ef510b6 |
|
MD5 | f516d9875a2c72ad9fc930d31931f440 |
|
BLAKE2b-256 | f1be9bfffa90b353380a6a96386f339ecf8023b72249dc2f4576d30808343acd |