TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database
Project description
TEI parser
This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.
It makes use of the following existing libraries:
- Beautiful Soup 4 An easy-to-use XML parser
- Spacy. Currently we use the german language package
de_core_news_sm
to parse the text. - Py2neo v4 whih is a library to work with the Neo4j database.
Installation
$ virtualenv venv
$ source venv/bin/activate
$ pip install -e TEIParse
$ python -m spacy download de_core_news_sm
$ pip install ../semper-backend # for the GraphUtils class
Synopsis
from tei2neo import parse
from semper_backend.utils import GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
filename=file,
start_with_tag='TEI',
idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()
ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')
# create unhyphened tokens
for para in paras:
tokens = ut.tokens_in_paragraph(para)
ut.create_unhyphenated(tokens)
# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
if 'lb' in token.labels:
print(' | ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
if 'lb' in token.labels:
print(' ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
How the parser works
A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.
Elements that affect all following elements
handShift
A handShift
element affects all elements that are below, until another handShift
element is encountered.
Example
From now on everything is written in «Latein» and a pencil is being used (medium=Blei):
<handShift new="#hWH" medium="Blei" script="Latein"/>
Now we switch to «Kurrent» script and use black ink (STinte):
<handShift new="#hGS" medium="STinte" script="Kurrent"/>
Appearance in Neo4j
As we have seen, a handShift
element contains three attributes:
- new="#hWH"
- medium="Blei"
- script="Latein"
These attributes are passed to all Token elements that follow after a handShift
occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same.
The handShift
element will not appear as a node in Neo4j.
delSpan
A delSpan
element works much like a handShift
element, as it alters the appearance of all the following text until it reaches its spanTo
target:
<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>
Appearance in Neo4j
- both the
delSpan
and theanchor
appear as additional nodes. - all elements between the
delSpan
and theanchor
element receive an additionaldelSpan
label - a
delSpan
attribute is added to every element, the value is equal to thexml:id
attribute of the anchor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tei2neo-0.6.1.tar.gz
.
File metadata
- Download URL: tei2neo-0.6.1.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e98c049772e70c6c63cfd0ac4d3800a756708cf96390be520da82d981641530 |
|
MD5 | da0f9d4eb7e2584a3cffa2d1edcb2f9d |
|
BLAKE2b-256 | 213031b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5 |
File details
Details for the file tei2neo-0.6.1-py3-none-any.whl
.
File metadata
- Download URL: tei2neo-0.6.1-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | acb71dcb6f47c66ccb631607ec493801dad917a33cdf89657738d2ae719c3c9c |
|
MD5 | 3b57f3ec33409bbc917d10e962d4c469 |
|
BLAKE2b-256 | 262e978e5a9d4e163407be1fdc0972242dfb36c130a87701bc4dc39f7e63a04b |