Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8cf4c0c3a9005a257d6d48389f42a3f6837a3fb555a528835d7ca450542fcea |
|
MD5 | 0f0e0bcf6fc4006542b3b462f12407d7 |
|
BLAKE2b-256 | b848b5cdefc4e79fc0407c4a6beccdcd885f46b0c0b94432088de4b9ae93f04c |
Close
Hashes for streamcorpus_pipeline-0.6.1-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9da66791057abe7d3728d1181bd67ce0a0deec71ed1e13637e1958a8cedbcba3 |
|
MD5 | 42fcb182aa239e223433660be643bd81 |
|
BLAKE2b-256 | 0a83873713f5b979761edeefbaa476d2aac02839aacf982149ad5a41bfb935ae |