Transform tei xml to a simple standoff format
Project description
Flatten Tei
Reformat tei-xml files to raw text + standoff annotations in json (flatdoc)
flatdocis not a standardized formatflatdocis a json file containing the whole text of a document in thetextfield- All span annotations are in 'annotations' in form of an object.
- e.g.
{"Sentence": [{'begin':0, 'end': 13}, ...], ..}
Access content of flatdoc files
Use Case: Get all Sentences of a document in flatdoc-format
- Assuming there are Sentence annotation.
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
sentences = get_units("Sentence", flatdoc)
Use Case: Get all Entities of a document in flatdoc-format
- Assuming the entities are stored as
Entityin theannotationsfield - (In the GSAP project
ScholarlyEntitiy) - enrich each entity with
Sentence-texts- They can be found in the
containerfield for each entity
- They can be found in the
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
entities = get_units("Entity", flatdoc, enrich_container="Sentence")
for ent in entities:
print(f'The entity span: {ent["text"]}')
sentence_text = ent['containers']['Sentence']['text']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
flattentei-0.1.3.tar.gz
(7.8 kB
view details)
File details
Details for the file flattentei-0.1.3.tar.gz.
File metadata
- Download URL: flattentei-0.1.3.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6768ade5c36a7a959ec81d5087bcc45d89e100bd97e664e04d0b1bfcf0fb478
|
|
| MD5 |
d68e9571b538a199a237f306a106ad43
|
|
| BLAKE2b-256 |
97de5bb70ce8e70a5b69546e7b95f23b0f207e278921c9c02d4a35164a101f4c
|