Skip to main content

Semi-structured Document Model

Project description

Semi-structured strongly typed document storage model for Python 3+


Overview

The document model consists of the following concepts:

  • Document: The overall container for everything (all nodes, layers, texts must be contained within)

  • Document fields: a single dictionary per document to store metadata in.

  • Text: The basic text representation, a wrapped string to track spans.

  • Text Spans: Subsequence of a string, can always be converted into a hard string by using str(span)

  • Layer: Collection of nodes

  • Layer Schema: Definition of field names and types when document is serialized

  • Node: Single node with zero or more fields with values

  • Node fields: Key, value pairs.

All parts of the document are accessible in three properties:

from docria.model import Document

doc = Document()
doc.props  # The Document metadata dictionary
doc.layers # The layer dictionary, name of layer to collection
doc.texts  # The texts dictionary.

Example of usage

:name How to create a document and insert nodes

from docria.model import Document, DataTypes as T
import re
# Stupid tokenizer
tokenizer = re.compile(r"[a-zA-Z]+|[0-9]+|[^\s]")

doc = Document()

# Create a new text context called 'main' with the text 'This code was written in Lund, Sweden.'
main_text = doc.add_text("main", "This code was written in Lund, Sweden.")
#                                 01234567890123456789012345678901234567
#                                 0         1         2         3

# Create a new layer with fields: id, text and head.
#
# Fields:
#   id is an int32
#   text is a span from context 'main'
#   head is a node reference into the token layer (the layer we are creating)
#
tokens = doc.add_layer("token", id=T.int32, text=main_text.spantype, head=T.noderef("token"))

# Adding nodes: Solution 1
i = 1
token_zero = None
token_two = None
for m in tokenizer.finditer(str(main_text)):
    token_node = tokens.add(id=i, text=main_text[m.start():m.end()])
    if i == 0:
        token_zero = token_node
    elif i == 2:
        token_two = token_node

    i += 1

token_two["head"] = token_zero

# Solution 2: If adding many nodes
token_list = []

i = 1
for m in tokenizer.finditer(str(main_text)):
    # This token is dangling, and is not attached until add_many
    token = Node({"id": i, "text": main_text[m.start():m.end()]}))
    token_list.append(token)
    i += 1

token_list[2]["head"] = token_list[0]
tokens.add_many(token_list)

Document I/O

In docria.storage there is a DocumentIO class which provides factory methods to create readers and writers.

:name How to create file writer and reader

from docria.storage import DocumentIO

with DocumentIO.write("output-file.docria") as docria_writer:
    for doc in documents:
        docria_writer.write(doc)


with DocumentIO.read("output-file.docria") as docria_reader:
    for doc in docria_reader:
        # Do something with doc, which is a document
        pass

Raw reading and writing of documents:

:name Using the Msgpack Codec

from docria.codec import MsgpackCodec

binarydata = bytes()  # from any location

# To decode into a document
doc = MsgpackCodec.decode(binarydata)

# To encode into a document
binarydata = MsgpackCodec.encode(doc)

Notes

Use regular object references when referring to a node.

The settings used for pretty printing is controlled by docria.printout.options.

By convention pretty printing will output [layer name]#[internal id] where the internal id can be used to get the node. However, this id is only guaranteed to be static if the layer is not changed, if changed it is invalid.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docria-0.2.0.tar.gz (29.6 kB view hashes)

Uploaded Source

Built Distribution

docria-0.2.0-py3-none-any.whl (34.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page