Skip to main content

TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database

Project description

TEI parser

This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.

It makes use of the following existing libraries:

  • Beautiful Soup 4 An easy-to-use XML parser
  • Spacy. Currently we use the german language package de_core_news_sm to parse the text.
  • Py2neo v4 whih is a library to work with the Neo4j database.

Installation

$ virtualenv venv
$ source venv/bin/activate
$ pip install -e TEIParse
$ python -m spacy download de_core_news_sm
$ pip install ../semper-backend   # for the GraphUtils class

Synopsis

from tei2neo import parse
from semper_backend.utils import GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
	filename=file,
	start_with_tag='TEI',
	idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()

ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')

# create unhyphened tokens
for para in paras:
    tokens = ut.tokens_in_paragraph(para)
    ut.create_unhyphenated(tokens)

# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
    if 'lb' in token.labels:
        print(' | ', end='')
    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
    if 'lb' in token.labels:
        print(' ', end='')

    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

How the parser works

A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.

Elements that affect all following elements

handShift

A handShift element affects all elements that are below, until another handShift element is encountered.

Example

From now on everything is written in «Latein» and a pencil is being used (medium=Blei):

<handShift new="#hWH" medium="Blei" script="Latein"/>

Now we switch to «Kurrent» script and use black ink (STinte):

<handShift new="#hGS" medium="STinte" script="Kurrent"/>

Appearance in Neo4j

As we have seen, a handShift element contains three attributes:

  • new="#hWH"
  • medium="Blei"
  • script="Latein"

These attributes are passed to all Token elements that follow after a handShift occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same. The handShift element will not appear as a node in Neo4j.

delSpan

A delSpan element works much like a handShift element, as it alters the appearance of all the following text until it reaches its spanTo target:

<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>

Appearance in Neo4j

  • both the delSpan and the anchor appear as additional nodes.
  • all elements between the delSpan and the anchor element receive an additional delSpan label
  • a delSpan attribute is added to every element, the value is equal to the xml:id attribute of the anchor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tei2neo-0.6.1.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

tei2neo-0.6.1-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file tei2neo-0.6.1.tar.gz.

File metadata

  • Download URL: tei2neo-0.6.1.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.4

File hashes

Hashes for tei2neo-0.6.1.tar.gz
Algorithm Hash digest
SHA256 9e98c049772e70c6c63cfd0ac4d3800a756708cf96390be520da82d981641530
MD5 da0f9d4eb7e2584a3cffa2d1edcb2f9d
BLAKE2b-256 213031b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5

See more details on using hashes here.

File details

Details for the file tei2neo-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: tei2neo-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.4

File hashes

Hashes for tei2neo-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 acb71dcb6f47c66ccb631607ec493801dad917a33cdf89657738d2ae719c3c9c
MD5 3b57f3ec33409bbc917d10e962d4c469
BLAKE2b-256 262e978e5a9d4e163407be1fdc0972242dfb36c130a87701bc4dc39f7e63a04b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page