Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.42.dev27.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7350415ba5b762ca861aced5ad5b039ca46b752da51c24e3ec53af6ee2db3c1 |
|
MD5 | f97e8ed8a50139c4f5979b444d3fda5f |
|
BLAKE2b-256 | b3d1b9faecf8c436ebb043b7117bb89ff8ab846fada4aa6aca0debaf946665e5 |
Close
Hashes for streamcorpus_pipeline-0.5.42.dev27-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 394bbaf0b0ffeed0b1cbbc72449051648459da54bd52f6c1326af59e4ea60e92 |
|
MD5 | 71383aac14290ea1172e6abaffb570fe |
|
BLAKE2b-256 | c8cf719f389b11d02fe68ede48dcd24ca1f677ead33f8b4dfa7c281bd566a6e9 |