Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.31.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2dad6895b4692a3a8cbf3d32cc77e3f0b02e4b4959c88a5861f30a75497da9a |
|
MD5 | 00c2b313451e3d07117fefcc925d4d40 |
|
BLAKE2b-256 | 0ab8b5f0545436633f104aba7539021cc8d3bd58930541e86894dfc7350b2fb6 |
Close
Hashes for streamcorpus_pipeline-0.5.31-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 105aadc79695c91aedcd3ca149fe1a97a5682f5f094e3d5633ff8e8c26c51a40 |
|
MD5 | 0ba802641217c5da9afc45eb1909734d |
|
BLAKE2b-256 | 1b8725a7361d28e400304171eb48892b90ef03895ab47f3bbd549c24d2f83e6f |