Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.53.dev1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02dfed351b77ca1d78fc73cdf090984f6dbe7e5d8b828c13e7b863e9a2917aba |
|
MD5 | feb7259a0e2549dad341acdbca0e663c |
|
BLAKE2b-256 | e36c25e36ddc157a64e6c682ddee0fca6101a595dbb586c1fa8842987b70887e |
Close
Hashes for streamcorpus_pipeline-0.5.53.dev1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bf40cdf76bb08a60f902aa5fcae8e6751bbb5fa8b1e148cd73a286b32e8d1a4 |
|
MD5 | 2530848a1b8f6c188a31b6ae34cedac5 |
|
BLAKE2b-256 | b4fc60693d7cc219f3c6a8bfdf9f445fc89b49f4221ec048345caa16bd4e261c |