Skip to main content

TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database

Project description

TEI parser

This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.

It makes use of the following existing libraries:

Synopsis

from tei2neo import parse, GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
	filename=file, 
	start_with_tag='TEI', 
	idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()

ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')

# create unhyphened tokens
for para in paras:
    tokens = ut.tokens_in_paragraph(para)
    ut.create_unhyphenated(tokens)
    
# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
    if 'lb' in token.labels:
        print(' | ', end='')
    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')
    
# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
    if 'lb' in token.labels:
        print(' ', end='')

    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

How the parser works

A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.

Elements that affect all following elements

handShift

A handShift element affects all elements that are below, until another handShift element is encountered.

Example

From now on everything is written in «Latein» and a pencil is being used (medium=Blei):

<handShift new="#hWH" medium="Blei" script="Latein"/>

Now we switch to «Kurrent» script and use black ink (STinte):

<handShift new="#hGS" medium="STinte" script="Kurrent"/>

Appearance in Neo4j

As we have seen, a handShift element contains three attributes:

  • new="#hWH"
  • medium="Blei"
  • script="Latein"

These attributes are passed to all Token elements that follow after a handShift occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same. The handShift element will not appear as a node in Neo4j.

delSpan

A delSpan element works much like a handShift element, as it alters the appearance of all the following text until it reaches its spanTo target:

<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>

Appearance in Neo4j

  • both the delSpan and the anchor appear as additional nodes.
  • all elements between the delSpan and the anchor element receive an additional delSpan label
  • a delSpan attribute is added to every element, the value is equal to the xml:id attribute of the anchor.

Elements that affect all contained elements

del

add

rs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tei2neo-0.1.0.tar.gz (17.5 kB view details)

Uploaded Source

File details

Details for the file tei2neo-0.1.0.tar.gz.

File metadata

  • Download URL: tei2neo-0.1.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.14.0 CPython/3.6.0

File hashes

Hashes for tei2neo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4e5c4b207698bbe942aea9b45c470ec456799ded4a079d459be4375403829e38
MD5 2a124437d6680464675df489d82b0180
BLAKE2b-256 250a52a12292c9f7d06ce204e52987def1e7a6aaf396501579fcd79e1feed606

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page