Tools for building streamcorpus objects, such as those used in TREC.
Project description
streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.
The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.
Read more at [streamcorpus.org](http://streamcorpus.org/)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for streamcorpus_pipeline-0.5.32.dev7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3153f2e1783f70dddc07a69fb4b5508e22a0274e1be3a064e87efebfa7fe55c |
|
MD5 | 3ef99233842fbfd32c42588eb0ef582f |
|
BLAKE2b-256 | 34a1c9c165a819b85f96ba89f024e0809fba37a63a59c9eefb997d610013a4bc |
Close
Hashes for streamcorpus_pipeline-0.5.32.dev7-py2.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab07c64b8103fea48dffa93a09fbfeeba29d7d174ae9ce25840e8d450218a384 |
|
MD5 | aabb94e86443a90db4e6f40bfd85f661 |
|
BLAKE2b-256 | 4f743a980ff197920549a29bf637086768ca124788752b3afef8df4d18b4bb71 |