Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.42.dev29.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10021b6eb8952de3fbe31a3415a77659ced463484c830f5225fa874c813abf2d |
|
MD5 | b62b64619d04e8345e1d4107185e2d5c |
|
BLAKE2b-256 | 4ab0c273793cba658a87ef90460e423f1f772047b1fc7dbba05f704910d86c9a |
Close
Hashes for streamcorpus_pipeline-0.5.42.dev29-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4afb22c5d3c7d05a2f2d39125a81f74ffbff3d6303d0ce201705c2f36827aff7 |
|
MD5 | 17080a0f2dbf3f431b9eb394bb6fcd7d |
|
BLAKE2b-256 | 446ab4cdafb29bf5ab44311dab58a977fb35db27b66aa47be1427373f083aa23 |