Skip to main content

Tools for building streamcorpus objects, such as those used in TREC.

Project description

streamcorpus_pipeline is a document processing pipeline that assembles streamcorpus objects from raw data sets.

The streamcorpus_pipeline python module contains tools for processing streamcorpus.StreamItem objects stored in Chunks. It includes transform functions for getting clean_html, clean_visible, creating labels from hyperlinks to particular sites (e.g. Wikipedia), and taggers like LingPipe, Serif, and Factorie, which make Tokens and Sentences.

Read more at [streamcorpus.org](http://streamcorpus.org/)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamcorpus_pipeline-0.7.11.tar.gz (9.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

streamcorpus_pipeline-0.7.11-py2.7.egg (10.3 MB view details)

Uploaded Egg

File details

Details for the file streamcorpus_pipeline-0.7.11.tar.gz.

File metadata

File hashes

Hashes for streamcorpus_pipeline-0.7.11.tar.gz
Algorithm Hash digest
SHA256 e6e941d35343a4bb5f86a3b689c0e3682e9a914d8ffe1452339685f821b7511d
MD5 765350b76986a5e0f726fd2c3132d928
BLAKE2b-256 5ebf9627f17609f418d77e6c8736355b6edde623d4feeec0d3ab9f4a44f26b94

See more details on using hashes here.

File details

Details for the file streamcorpus_pipeline-0.7.11-py2.7.egg.

File metadata

File hashes

Hashes for streamcorpus_pipeline-0.7.11-py2.7.egg
Algorithm Hash digest
SHA256 59ceaea860bca089f5fb39fc55367d89ffb6c5c936199042f8a7dedb86690728
MD5 dc096872982a0bc64d48018085f41eae
BLAKE2b-256 a6b0ccd90a4dec8796d2f8eaef2a84dd3a19d8f333119775416868bb0130938e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page