Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.5.dev1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f5f3a0e9867c369d218da48e53c2513efd000432fc6b9d46d2faf00e8e31897 |
|
MD5 | f790257c54d9ab95d92670f494bb892b |
|
BLAKE2b-256 | 4d026e3bc15c58d98ed331ae894ebdd6b4b30711cb7a1d44ed9cb22dd3e241bb |
Close
Hashes for streamcorpus_pipeline-0.7.5.dev1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8f3f5faed26dfe288680ddee055b450cdd44e6c8c648feb7905b94c8141cf12 |
|
MD5 | 108fe697f7d687f61970239a659eb4f5 |
|
BLAKE2b-256 | ee40a6da656716eb8b1de44dfd9362c8f2deb8a23c60de1fa5580bc7a425789a |