Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.6.8.dev22.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4cc62a175efbb4a6fd204400cd22434ca9ff2a0e3b328886eaf41544da23964 |
|
MD5 | 8516395165783aa00d1867937e54cb6d |
|
BLAKE2b-256 | b8c44ddae6dc7baaa5d3319bb53b9646412fc88b5bcbbb636d0f962115aa31c3 |
Close
Hashes for streamcorpus_pipeline-0.6.8.dev22-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbc79d5ad1532c097119c1026899ca481cb00a4a4208a2e796105f90b8466c81 |
|
MD5 | f7f0f004375f22b86a4968208d1c3cf6 |
|
BLAKE2b-256 | 4479ff0152ce6ad8fe3c5332fcc621320801f7c4c8e989f8e1518d7d72a854c5 |