Transform tei xml to a simple standoff format
Project description
Flatten Tei
Reformat tei-xml files to raw text + standoff annotations in json (flatdoc)
flatdocis not a standardized formatflatdocis a json file containing the whole text of a document in thetextfield- All span annotations are in 'annotations' in form of an object.
- e.g.
{"Sentence": [{'begin':0, 'end': 13}, ...], ..}
Access content of flatdoc files
Use Case: Get all Sentences of a document in flatdoc-format
- Assuming there are Sentence annotation.
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
sentences = get_units("Sentence", flatdoc)
Use Case: Get all Entities of a document in flatdoc-format
- Assuming the entities are stored as
Entityin theannotationsfield - (In the GSAP project
ScholarlyEntitiy) - enrich each entity with
Sentence-texts- They can be found in the
containerfield for each entity
- They can be found in the
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
entities = get_units("Entity", flatdoc, enrich_container="Sentence")
for ent in entities:
print(f'The entity span: {ent["text"]}')
sentence_text = ent['containers']['Sentence']['text']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
flattentei-0.1.4.tar.gz
(8.5 kB
view details)
File details
Details for the file flattentei-0.1.4.tar.gz.
File metadata
- Download URL: flattentei-0.1.4.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e2e5aebd2dccffe072b57a67781d014cf7001e51ee33b4e2fc168dff7845565
|
|
| MD5 |
5e15acdf8f3bd002211442b32540db85
|
|
| BLAKE2b-256 |
a082a99c75c344bee26c215d8e121e72c9197e6b3feab417d1849122dfc6661e
|