Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.32.dev16.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ebae2935557a4d4ffca1ebe6b8873bd0ed001488596ff74b5d4a55dcf479e94 |
|
MD5 | 09b0b4968fdf265b20c33293a50916d5 |
|
BLAKE2b-256 | 236afa1845aea61e242b0dc69aebf640529fc18bcbe91a54c86f681794660958 |
Close
Hashes for streamcorpus_pipeline-0.5.32.dev16-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 389dce2b105d182f62fb37cb0227a9cfd2be248cf1775a3da2221c2e85a6cd95 |
|
MD5 | 2c80cf81b7833137ca1d5a540df85eb4 |
|
BLAKE2b-256 | 8f1a67c16f6fed9fd3e16e14bf00f6fd9498b48acfae6f44101a183c9eeaebc0 |