Transform tei xml to a simple standoff format
Project description
Flatten Tei
Reformat tei-xml files to raw text + standoff annotations in json (flatdoc)
flatdocis not a standardized formatflatdocis a json file containing the whole text of a document in thetextfield- All span annotations are in 'annotations' in form of an object.
- e.g.
{"Sentence": [{'begin':0, 'end': 13}, ...], ..}
Access content of flatdoc files
Use Case: Get all Sentences of a document in flatdoc-format
- Assuming there are Sentence annotation.
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
sentences = get_units("Sentence", flatdoc)
Use Case: Get all Entities of a document in flatdoc-format
- Assuming the entities are stored as
Entityin theannotationsfield - (In the GSAP project
ScholarlyEntitiy) - enrich each entity with
Sentence-texts- They can be found in the
containerfield for each entity
- They can be found in the
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
entities = get_units("Entity", flatdoc, enrich_container="Sentence")
for ent in entities:
print(f'The entity span: {ent["text"]}')
sentence_text = ent['containers']['Sentence']['text']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
flattentei-0.1.7.tar.gz
(9.5 kB
view details)
File details
Details for the file flattentei-0.1.7.tar.gz.
File metadata
- Download URL: flattentei-0.1.7.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98835ee9b75173075c74dc45b75b00e786d105f59dacf7d7d8f058d926ead680
|
|
| MD5 |
2e24a78df8cc916e7958737bcf6c6ec0
|
|
| BLAKE2b-256 |
0f8df8ba573291f106cd9eae4898aba7e4087f1283ff23c5a57dcf268a81c3a4
|