Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.54.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5e62d2e5423ccfdd81a2c0b91e7c659e505d078a9345935317dcca1814d2f46 |
|
MD5 | 1b31cbc8a65afb49819f578f28a4a700 |
|
BLAKE2b-256 | f0a85d67d1a7e8785d5a1872fe0f10af78b95ce3cca6c33da6714b4743eac2be |
Close
Hashes for streamcorpus_pipeline-0.5.54-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7e4cd57d66791b2dfeb0e5278c3327056f5efacc87824b45ea7aeebddc00e65 |
|
MD5 | 181d6f58b4fb5a0014b2b49bf3e5305f |
|
BLAKE2b-256 | 3c5a036a03f7d24c1eda2b784d6bff09d8dd7d98128057857659ecd7caae7a44 |