Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c0eec0db69c97888a3c25b7da3f8d96f0471bbcf6679a8ea6cf4e6c6955e782 |
|
MD5 | 7ecd70f5ab713f1f83b1e5a4d50e1007 |
|
BLAKE2b-256 | 8334dc6254cc9a8e3e9767bef985014924704ca208bc0369172264d9f0474f94 |
Close
Hashes for streamcorpus_pipeline-0.7.9-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | b47637c1f4dae732ae53ed7e5db364c5b16b376e5c1e59e58eb6fec136b48bdf |
|
MD5 | 2d17c6b93f3615111584766f5c84960c |
|
BLAKE2b-256 | 950c3e2229d2d60ff230b3993a423c199f39a5f219e4f5b6c7f34865eb6ee2f2 |