Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.7.11.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6e941d35343a4bb5f86a3b689c0e3682e9a914d8ffe1452339685f821b7511d |
|
MD5 | 765350b76986a5e0f726fd2c3132d928 |
|
BLAKE2b-256 | 5ebf9627f17609f418d77e6c8736355b6edde623d4feeec0d3ab9f4a44f26b94 |
Close
Hashes for streamcorpus_pipeline-0.7.11-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59ceaea860bca089f5fb39fc55367d89ffb6c5c936199042f8a7dedb86690728 |
|
MD5 | dc096872982a0bc64d48018085f41eae |
|
BLAKE2b-256 | a6b0ccd90a4dec8796d2f8eaef2a84dd3a19d8f333119775416868bb0130938e |