Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.42.dev11.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 124fe7dec3c8a323d89d0f4de1e2aa7d17b0cb71562a96159db1941412708e1e |
|
MD5 | be9b2793c8a7c91455097a648396407f |
|
BLAKE2b-256 | 425633d6930404c2535933dd66cfdc293d6be196164f2931d5cf1afd1643e8d3 |
Close
Hashes for streamcorpus_pipeline-0.5.42.dev11-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | e993707fef13213c7050c4a07b99eff92b41a58b712681115a4c34da0c2b9605 |
|
MD5 | 35b2dd61cc71ede8d28ef1a586721592 |
|
BLAKE2b-256 | 12e9ba551cd48782ec5228e58d1be5aa3df5d361e0fd0a2d781c8e28a5a36403 |