Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ac09b00beb9d816ee2d23af2c8c853add0f07711e1b53454fdff821fc781a99 |
|
MD5 | 23aec2d7fa6b5ed83a82b31d89e2c6f0 |
|
BLAKE2b-256 | 3cae7874894083966abc0bd9862da9fb957b7a4d57a7749ea191e89cd6b61f1d |
Close
Hashes for streamcorpus_pipeline-0.7.0-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1047443349b3b5e1a61fc7dc972d6d7079c8f1cd75b435f188aca460e8ac1c74 |
|
MD5 | b8aaa28d1e60820e7e57d7c047dd31a8 |
|
BLAKE2b-256 | 0b31f31b94bd7974a0b385eb903d98adcd324dc1045c013f365808d19cce77f3 |