Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.4.dev5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e1c2a73ec218d5ef3e0128ef37bbcc596915a624bd6fc18aeac9a429ae12899 |
|
MD5 | db04bd95afd070537f257c4cc59e002c |
|
BLAKE2b-256 | 00971aaa4e05af14ffef4596adf79b3aeefe16f77b3eb2f395f93591e059ce1e |
Close
Hashes for streamcorpus_pipeline-0.6.4.dev5-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2ea1ad4d36e9fdabd0de414cb351b5b4df6f1b689b6f784ce71e49bc9323e92 |
|
MD5 | 0b3d5f83c26a80b3755172a71c7fafce |
|
BLAKE2b-256 | c4e1b286dfba81598f6bdbee746e554445709644662c52b4b522a57d2b3b5ded |