Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.39.dev10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b59a2e2785a16d19ace75ebfb3e1cdcadfd74be20c86b7df17c7d718a0f94d3c |
|
MD5 | a9a8916c3b137982a278bafaf6729ade |
|
BLAKE2b-256 | 471b247280d3ea158ca50cc6d9e1ed67308d90d6e5289d1694031f46b577c75c |
Close
Hashes for streamcorpus_pipeline-0.5.39.dev10-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c58937f32b72e640ee560970f4f78e5c63693c0fce7f91eea5cc21268c4d452 |
|
MD5 | fec354a2ab75603508550be741f89502 |
|
BLAKE2b-256 | 90515815fa8a000500942f32de9965df93d13849ce50621b50b3647dfe6f0a44 |