Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.17.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 060f1495a86569d2ae888ec925657836831b2660b74bec03549040dea5eae086 |
|
MD5 | d606705e4de912277432b746885b59e2 |
|
BLAKE2b-256 | 58b09a41016044e4ede62a291a6dbc38d12b1393939faf4e044199b351e2cffc |
Close
Hashes for streamcorpus_pipeline-0.7.17-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8429c149653d3c63a2b00cdde93d4f92e9eeb60026bad5344ddf1510bfc9a979 |
|
MD5 | d043b4ec4a19bf200830b7d5684a2db2 |
|
BLAKE2b-256 | 9d4e39a392305650eccac55cbf47859ba5b1995a01d1b6befc786cb14efe12f4 |