Skip to main content

TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database

Project description

TEI parser

This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.

It makes use of the following existing libraries:

  • Beautiful Soup 4 An easy-to-use XML parser
  • Spacy. Currently we use the german language package de_core_news_sm to parse the text.
  • Py2neo v4 whih is a library to work with the Neo4j database.

Installation

$ pip install tei2neo
$ python "de_core_news_sm @https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.6.0/de_core_news_sm-3.6.0.tar.gz"

Synopsis

from tei2neo import parse, GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
	filename=file,
	start_with_tag='TEI',
	idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()

ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')

# create unhyphened tokens
for para in paras:
    tokens = ut.tokens_in_paragraph(para)
    ut.create_unhyphenated(tokens)

# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
    if 'lb' in token.labels:
        print(' | ', end='')
    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
    if 'lb' in token.labels:
        print(' ', end='')

    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

How the parser works

A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.

Elements that affect all following elements

handShift

A handShift element affects all elements that are below, until another handShift element is encountered.

Example

From now on everything is written in «Latein» and a pencil is being used (medium=Blei):

<handShift new="#hWH" medium="Blei" script="Latein"/>

Now we switch to «Kurrent» script and use black ink (STinte):

<handShift new="#hGS" medium="STinte" script="Kurrent"/>

Appearance in Neo4j

As we have seen, a handShift element contains three attributes:

  • new="#hWH"
  • medium="Blei"
  • script="Latein"

These attributes are passed to all Token elements that follow after a handShift occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same. The handShift element will not appear as a node in Neo4j.

delSpan

A delSpan element works much like a handShift element, as it alters the appearance of all the following text until it reaches its spanTo target:

<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>

Appearance in Neo4j

  • both the delSpan and the anchor appear as additional nodes.
  • all elements between the delSpan and the anchor element receive an additional delSpan label
  • a delSpan attribute is added to every element, the value is equal to the xml:id attribute of the anchor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tei2neo-0.5.0.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

tei2neo-0.5.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file tei2neo-0.5.0.tar.gz.

File metadata

  • Download URL: tei2neo-0.5.0.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for tei2neo-0.5.0.tar.gz
Algorithm Hash digest
SHA256 2af0652db27e51cdd7112eefeb8f4068f98c007d1002c81f790146b124135c7e
MD5 405432a69c8f6d9f19a29f73da7be3ee
BLAKE2b-256 c812ab19c62b49e9f9d3b863fcd9cff92d007886cb990202dd55a1c0764009dc

See more details on using hashes here.

File details

Details for the file tei2neo-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: tei2neo-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for tei2neo-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b79e1caddb28fae9e69e82b51c1a4fdf9063ea8648202872c3132c5288fe169
MD5 c0aad093dfe3b473e8fc5bf232138c5a
BLAKE2b-256 df1aee49d007e91df5d178e366e87cbfa6614f577466fed0ad8fcada2b0b979d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page