Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.42.dev21.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfd5377b910986372f0ee67e5b054895044590cbcbc295b9887f1729a82d6411 |
|
MD5 | 2362739b67a818d0e5b24ee32298d777 |
|
BLAKE2b-256 | 5874f8d8d8de56f9b8aa832ed89f6f8d9878557a236effdc1d45930f9e41d72b |
Close
Hashes for streamcorpus_pipeline-0.5.42.dev21-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | d96e06f69159ed6e3d8c90878a4be4930bf7a49cffeb021429fc4a188c0c9441 |
|
MD5 | e44ade3d971ae4ead51165a982aaec77 |
|
BLAKE2b-256 | 43a91d8121d0de60bfbf7edfdb8a7adf9fa9d9294b904fcd1a656ef683f370c3 |