Transform TEI XML to a simple standoff format
Project description
flattentei
Convert TEI XML documents to plain text with standoff annotations — a simple, pipeline-friendly format for NLP workflows.
What it does
flattentei reads TEI XML files and produces:
- A plain text string of the full document
- Span annotations (begin/end offsets into the text) for structural elements like paragraphs, sentences, section headings, references, and figures
- Structured metadata (authors, DOI, journal, affiliations, ORCID, …)
- A list of figures with captions
- Typed
Docobjects that support sentence splitting, span lookup, and relation attachment for downstream NLP pipelines
The standoff JSON format (flatdoc) keeps text and annotations strictly separated, which makes it easy to feed into annotation tools, relation extraction pipelines, or fine-tuning workflows.
Supported TEI dialects
| Dialect key | Description |
|---|---|
"tei_wdm" |
WDM / ULB Darmstadt TEI — journal articles converted from JATS |
The original GROBID-based parser (transform_xml) is still available for backwards compatibility.
Installation
pip install flattentei
Requires Python ≥ 3.10.
Quick start
Parse a WDM TEI file
import flattentei
doc = flattentei.parse_xml("article.xml", dialect="tei_wdm")
# unpack the three main outputs
text, annotations, metadata = doc
print(doc.doc_id) # e.g. "jz000102-0007"
print(doc.metadata["title"]) # "Detuned Resonances"
print(doc.metadata["authors"]) # [{"surname": "Colyer", "forename": "Greg", ...}, ...]
print(doc.metadata["doi"]) # "10.3390/fluids7090297"
Also accepts raw bytes:
doc = flattentei.parse_xml(Path("article.xml").read_bytes(), dialect="tei_wdm")
Work with sentences and spans
doc.sentences returns a list of Sentence objects. If the XML contains sentence markup they are used directly; otherwise NLTK sent_tokenize is applied paragraph-by-paragraph.
for sent in doc.sentences:
print(sent.sentence_idx, sent.text)
for span in sent.spans:
# span.span_type e.g. "ReferenceToBib", "Paragraph", …
# span.begin / span.end — offsets within the sentence
# span.begin_in_doc / span.end_in_doc — offsets in the full document
print(f" [{span.span_type}] {span.text!r}")
Export to flatdoc JSON
The to_json() method returns a dict compatible with the original flatdoc format ({"text": …, "annotations": …}):
import json
flat = doc.to_json()
json.dump(flat, open("article.json", "w"))
Access span annotations directly
# all paragraph offsets
for para in doc.spans["Paragraph"]:
print(doc.text[para["begin"]:para["end"]])
# all in-text citation spans with their targets
for ref in doc.spans.get("ReferenceToBib", []):
print(ref["target"], doc.text[ref["begin"]:ref["end"]])
Attach relations (NLP pipeline output)
Relation connects two Span objects with an optional label and confidence score. Designed to hold the output of entity and relation extraction models.
from flattentei import Relation
sents = doc.sentences
subj = sents[2].spans[0]
obj = sents[2].spans[1]
# attach to a sentence
sents[2].relations.append(Relation(subject=subj, object=obj, label="cites", score=0.91))
# or to the whole document
doc.relations.append(Relation(subject=subj, object=obj, label="authored_by"))
Load existing flatdoc JSON files
import json
from flattentei import get_units
with open("article.json") as f:
flatdoc = json.load(f)
# extract sentences with their text
sentences = get_units("Sentence", flatdoc)
# extract entities enriched with the surrounding sentence text
entities = get_units("Entity", flatdoc, enrich_container=["Sentence"])
for ent in entities:
print(ent["text"], ent["container"]["Sentence"]["text"])
Batch convert a folder of TEI XML files
flatten-tei-folder --source ./xml_files --target ./output
Or from Python:
from flattentei.tei_to_text_and_standoff import transform_xml_folder
from pathlib import Path
transform_xml_folder(Path("xml_files"), Path("output"))
Data model
Doc
├── doc_id: str
├── text: str
├── spans: dict[str, list[dict]] # {"Paragraph": [{begin, end, idx, …}, …], …}
├── metadata: dict # title, authors, doi, journal, …
├── figures: list[dict] # id, head, label, url
├── relations: list[Relation]
└── sentences → list[Sentence] # property, computed on access
Sentence
├── doc_id, sentence_id, sentence_idx
├── text, begin_idx
├── spans: list[Span]
└── relations: list[Relation]
Span
├── doc_id, text, span_type
├── begin, end # relative to parent container
└── begin_in_doc, end_in_doc
Relation
├── subject: Span
├── object: Span
├── label: str | None
└── score: float | None
Span types produced by the WDM parser
| Type | Description |
|---|---|
Abstract |
Abstract section |
Div |
Section div (with optional id) |
Head |
Section heading (with optional n, id) |
Paragraph |
Paragraph |
ReferenceToBib |
In-text citation (with target = bib entry id) |
ReferenceToFigure |
In-text figure reference |
ReferenceToSection |
In-text section cross-reference |
ReferenceString |
Full formatted reference entry |
SectionHeader |
Title + abstract region |
SectionMain |
Body text region |
SectionFootnote |
Back-matter notes region |
SectionReference |
Reference list region |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flattentei-0.1.9.tar.gz.
File metadata
- Download URL: flattentei-0.1.9.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dd4ac62b0414e1a4bb36c5619b20e4c0fab9b6e8d0d4e2f7d3bc3d299877732
|
|
| MD5 |
1bc16e0da3e32e8be3511e0ecca699cf
|
|
| BLAKE2b-256 |
2ba574ff0910ddb8beb560f27d2dce56e04cb5fcffedadfbbf482e2c2aa5f7a8
|
File details
Details for the file flattentei-0.1.9-py3-none-any.whl.
File metadata
- Download URL: flattentei-0.1.9-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
879caea9831be06b4658fb8b7521c30e84cfe4fd21b341c4bedf8d8d696ceca1
|
|
| MD5 |
d3e81302e256c234b978cf68a2b84f37
|
|
| BLAKE2b-256 |
4d52e058534ae74faa4debfa7974d2cc8b3531419e725f6a8f2a670c80fcd1ca
|