Skip to main content

TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database

Project description

TEI parser

This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.

It makes use of the following existing libraries:

  • Beautiful Soup 4 An easy-to-use XML parser
  • Spacy. Currently we use the german language package de_core_news_sm to parse the text.
  • Py2neo v4 whih is a library to work with the Neo4j database.

Installation

$ pip install tei2neo
$ pip install "de_core_news_sm @https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.6.0/de_core_news_sm-3.6.0.tar.gz"

Synopsis

from tei2neo import parse, GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
	filename=file,
	start_with_tag='TEI',
	idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()

ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')

# create unhyphened tokens
for para in paras:
    tokens = ut.tokens_in_paragraph(para)
    ut.create_unhyphenated(tokens)

# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
    if 'lb' in token.labels:
        print(' | ', end='')
    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
    if 'lb' in token.labels:
        print(' ', end='')

    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

How the parser works

A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.

Elements that affect all following elements

handShift

A handShift element affects all elements that are below, until another handShift element is encountered.

Example

From now on everything is written in «Latein» and a pencil is being used (medium=Blei):

<handShift new="#hWH" medium="Blei" script="Latein"/>

Now we switch to «Kurrent» script and use black ink (STinte):

<handShift new="#hGS" medium="STinte" script="Kurrent"/>

Appearance in Neo4j

As we have seen, a handShift element contains three attributes:

  • new="#hWH"
  • medium="Blei"
  • script="Latein"

These attributes are passed to all Token elements that follow after a handShift occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same. The handShift element will not appear as a node in Neo4j.

delSpan

A delSpan element works much like a handShift element, as it alters the appearance of all the following text until it reaches its spanTo target:

<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>

Appearance in Neo4j

  • both the delSpan and the anchor appear as additional nodes.
  • all elements between the delSpan and the anchor element receive an additional delSpan label
  • a delSpan attribute is added to every element, the value is equal to the xml:id attribute of the anchor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tei2neo-0.5.3.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

tei2neo-0.5.3-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file tei2neo-0.5.3.tar.gz.

File metadata

  • Download URL: tei2neo-0.5.3.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.4

File hashes

Hashes for tei2neo-0.5.3.tar.gz
Algorithm Hash digest
SHA256 38943139713bb5b561fc2f9d8bd4f1c81593e4645309f7234fc4cca132bbde8b
MD5 b30b12b150bb9b67cfada1f175c08c7f
BLAKE2b-256 f1fdd99d783d4f8cb5e52decfb5035f939f517dd84e64a05dd07246b41a16508

See more details on using hashes here.

File details

Details for the file tei2neo-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: tei2neo-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.4

File hashes

Hashes for tei2neo-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bc6007cf4eaa3cfe76799a206985a19c55eb535f3e409d035fc0b5df6173423b
MD5 c63d4a98f8a97ea013b098302366db75
BLAKE2b-256 b4f03501cd1e744218eab9427a91f2b53cc72ea03a6fcb405a62b831c0e0ba06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page