Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.8.dev23.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15078a80deeca602fecefb545f8b65c986eb2bc3acb3f7d5ee200dd8631951aa |
|
MD5 | 6d1deb3cbc561eb55a1dbff7b9fd8931 |
|
BLAKE2b-256 | 7a857c6ddfef56e146718f8bd06683abf21c86313657815e5789ccde6810fa7b |
Close
Hashes for streamcorpus_pipeline-0.6.8.dev23-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc9a2b01addcc9e17a29f5850db42c91aa1f2440b87e12c123b84d769b18506b |
|
MD5 | ad48fa47928c368996f2ab6166733203 |
|
BLAKE2b-256 | 0d799f5392d103deceeb90951ad2fda6da9821d6242cd2ae6638effcb99401d0 |