TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database
Project description
TEI parser
This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a Neo4j Graph Database.
It makes use of the following existing libraries:
- Beautiful Soup 4 An easy-to-use XML parser
- Spacy. Currently we use the german language package
de_core_news_smto parse the text. - Py2neo v4 whih is a library to work with the Neo4j database.
Installation
$ virtualenv venv
$ source venv/bin/activate
$ pip install -e TEIParse
$ python -m spacy download de_core_news_sm
$ pip install ../semper-backend # for the GraphUtils class
Synopsis
from tei2neo import parse
from semper_backend.utils import GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
filename=file,
start_with_tag='TEI',
idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()
ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')
# create unhyphened tokens
for para in paras:
tokens = ut.tokens_in_paragraph(para)
ut.create_unhyphenated(tokens)
# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
if 'lb' in token.labels:
print(' | ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
if 'lb' in token.labels:
print(' ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
How the parser works
A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.
Elements that affect all following elements
handShift
A handShift element affects all elements that are below, until another handShift element is encountered.
Example
From now on everything is written in «Latein» and a pencil is being used (medium=Blei):
<handShift new="#hWH" medium="Blei" script="Latein"/>
Now we switch to «Kurrent» script and use black ink (STinte):
<handShift new="#hGS" medium="STinte" script="Kurrent"/>
Appearance in Neo4j
As we have seen, a handShift element contains three attributes:
- new="#hWH"
- medium="Blei"
- script="Latein"
These attributes are passed to all Token elements that follow after a handShift occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same.
The handShift element will not appear as a node in Neo4j.
delSpan
A delSpan element works much like a handShift element, as it alters the appearance of all the following text until it reaches its spanTo target:
<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>
Appearance in Neo4j
- both the
delSpanand theanchorappear as additional nodes. - all elements between the
delSpanand theanchorelement receive an additionaldelSpanlabel - a
delSpanattribute is added to every element, the value is equal to thexml:idattribute of the anchor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tei2neo-0.6.1.tar.gz.
File metadata
- Download URL: tei2neo-0.6.1.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e98c049772e70c6c63cfd0ac4d3800a756708cf96390be520da82d981641530
|
|
| MD5 |
da0f9d4eb7e2584a3cffa2d1edcb2f9d
|
|
| BLAKE2b-256 |
213031b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5
|
File details
Details for the file tei2neo-0.6.1-py3-none-any.whl.
File metadata
- Download URL: tei2neo-0.6.1-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acb71dcb6f47c66ccb631607ec493801dad917a33cdf89657738d2ae719c3c9c
|
|
| MD5 |
3b57f3ec33409bbc917d10e962d4c469
|
|
| BLAKE2b-256 |
262e978e5a9d4e163407be1fdc0972242dfb36c130a87701bc4dc39f7e63a04b
|