Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.32.dev3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 417add3e9da08333ec7765cd078b79e73d0a5ab523b5d11a75b409627276be0a |
|
MD5 | 76b85cffd9f4ad2d2fd7ae08b60568da |
|
BLAKE2b-256 | 81f3658d8cb4e4478018756464dcb4fbc76d352f58ad254e03b88c28775739d5 |
Close
Hashes for streamcorpus_pipeline-0.5.32.dev3-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | a14f8ccab37df54594c688482b0f4aa19a691818d758e954ac777eca441b61bd |
|
MD5 | 584b83b489bfaf84dbb678a85ef7788f |
|
BLAKE2b-256 | e0094394fa2185a5c73c874d6490da961c43936c1b1418db0e838c4a0dd4a0a0 |