Transform tei xml to a simple standoff format
Project description
Flatten Tei
Reformat tei-xml files to raw text + standoff annotations in json (flatdoc)
flatdocis not a standardized formatflatdocis a json file containing the whole text of a document in thetextfield- All span annotations are in 'annotations' in form of an object.
- e.g.
{"Sentence": [{'begin':0, 'end': 13}, ...], ..}
Access content of flatdoc files
Use Case: Get all Sentences of a document in flatdoc-format
- Assuming there are Sentence annotation.
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
sentences = get_units("Sentence", flatdoc)
Use Case: Get all Entities of a document in flatdoc-format
- Assuming the entities are stored as
Entityin theannotationsfield - (In the GSAP project
ScholarlyEntitiy) - enrich each entity with
Sentence-texts- They can be found in the
containerfield for each entity
- They can be found in the
from flattentei import get_units
fn = <filename of flatdoc json file>
with open(fn) as f:
flatdoc = json.load(f)
entities = get_units("Entity", flatdoc, enrich_container="Sentence")
for ent in entities:
print(f'The entity span: {ent["text"]}')
sentence_text = ent['containers']['Sentence']['text']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
flattentei-0.1.5.tar.gz
(8.9 kB
view details)
File details
Details for the file flattentei-0.1.5.tar.gz.
File metadata
- Download URL: flattentei-0.1.5.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0ca390bc4c6b9b60c9517938f595b5c9eaeb58b9f348ef3bfc4aebd687aeed5
|
|
| MD5 |
c380e8682a9428e2bd9b1a324ec5976f
|
|
| BLAKE2b-256 |
f6e4656739896159c5c5a1980f0daf4ac5466de2fc10623919529a865695d324
|