Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.29.dev2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2bcaab88d88338868368c6c938d254c6724e9b7667412f9a2eb6e8b387e190a4 |
|
MD5 | 4c56fe790af399d65a2de2450a7c4493 |
|
BLAKE2b-256 | d903017a4d72421c2d98374eeb52cacf020c8cf0341ada6d6270be4b6aada036 |
Close
Hashes for streamcorpus_pipeline-0.5.29.dev2-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdb04f9967873d73e5a29b35e914c7f6b39aee53f806c484ede9be74d7769341 |
|
MD5 | 289202723eb04fdff656939360d793b4 |
|
BLAKE2b-256 | 2d9d0d7a77fb677d243324d334489410452a677d3cc7f07ac15e64850f694e19 |