Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.43.dev7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa2283f6a8259c4f774ed274707da5e14fd827bf9fd86ca0a8b02ffd7e27f0bb |
|
MD5 | a14e49993b25140ee7963f1fe2894668 |
|
BLAKE2b-256 | 4aac1ff564a87c62dd87796487374b400ae2e4f0b66a7458afeb375132d885b0 |
Close
Hashes for streamcorpus_pipeline-0.5.43.dev7-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6fde53441e186982b015fa8375206dc32a36151eb73e781e0df05c46164576b |
|
MD5 | c46fb44c3b552dad2e407ddca8599355 |
|
BLAKE2b-256 | 7acd08df86b19cc5957fe66209cfe03d50fd5b6d3bbe595370898e9d71fe0335 |