Skip to main content

Tools for organizing a collections of text for entity-centric stream processing.

Project description

Discussion forum: https://groups.google.com/forum/#!forum/streamcorpus

streamcorpus provides a common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text. It offers these benefits:

  • Based on Thrift, so is fast to serialize/deserialize and has easy-to-use language bindings for many languages.

  • Convenience methods for serializing batches of documents into flat files, which we call Chunks. For example, the TREC KBA corpus is stored in streamcorpus.Chunk files, see http://trec-kba.org/

  • Unifies NLP data structures so that one pipeline can use different taggers in a unified way. For example, tokenization, sentence chunking, entity typing, human-generated annotation, and offsets are all defined such that output from most tagging tools can be easily transformed into streamcorpus structures. It is currently in use with LingPipe and Stanford CoreNLP, and we are working towards testing with more.

  • Once a StreamItem has one or more sets of tokenized Sentence arrays, one can easily run downstream analytics that leverage the attributes on the token stream.

  • Makes timestamping a central part of corpus organization, because every corpus is inherently anchored in history. Streaming data is increasingly important in many applications.

  • Has basic versioning and builds on Thrift’s extensibility.

See if/streamcorpus.thrift for details.

See py/ for a python module built around the results of running thrift –gen py streamcorpus.thrift, which is done py/Makefile

If you are interested in building a streamcorpus package around the Thrift generated code for another language, please post to the discussion forum: https://groups.google.com/forum/#!forum/streamcorpus

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamcorpus-0.2.18.tar.gz (28.8 kB view details)

Uploaded Source

File details

Details for the file streamcorpus-0.2.18.tar.gz.

File metadata

  • Download URL: streamcorpus-0.2.18.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for streamcorpus-0.2.18.tar.gz
Algorithm Hash digest
SHA256 819f3cc71fd3b9de3a969f60bc4d3c40b1f83a07c0b122fa7378fa68dc973ed8
MD5 826e50aab9cdc357f7e52c088970bf7e
BLAKE2b-256 5e5e33064b3ad5d0e94205e3effbca361e86ae1108b00268bd2266e43da73326

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page