Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.32.dev5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 247f0da341030d9bcc60ad604ab6ed74a9b40d4ea5669482126cd9934bbb03b7 |
|
MD5 | 95a3843827091b5cf7a657de01fe0416 |
|
BLAKE2b-256 | 7b3b19b980afc96827b78e2cd4848dbb4c25ad838dda295fe5bd87e6c12169ef |
Close
Hashes for streamcorpus_pipeline-0.5.32.dev5-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d6745ea63e4f6f668e09af8bd2e65e64324d107d69c442eee65a2159a8c7495 |
|
MD5 | ea5728976370258620a02a1bb549e161 |
|
BLAKE2b-256 | 75c258651161018f5ae660747b74b7c84c6cd5575f3b6688aa1cb8d280cd2b70 |