Skip to main content

A Python utility for extracting RDFa data from TEI-XML documents.

Project description

[!IMPORTANT] tei-rdfa is currently in beta and welcomes feedback from early adopters.

tei-rdfa

A Python utility for extracting RDFa data from TEI-XML documents.

tei-rdfa

Overview

tei_rdfa() is a dedicated function that extracts RDFa (Resource Description Framework in Attributes) data embedded in TEI (Text Encoding Initiative) XML documents and converts it into a standard RDF graph.

The function handles native TEI namespace formatting through <prefixDef> elements (//tei:encodingDesc/tei:listPrefixDef/tei:prefixDef) rather than through the HTML5-style prefix or XHTML/XML-style xmlns:prefix attributes.[^1]

Features

  • Loads TEI-XML files from local paths or URLs
  • Processes RDFa attributes about, typeof, property, rel, rev, resource, content
  • Extracts and resolves namespace prefixes from TEI-specific <prefixDef> elements
  • Generates RDF triples from embedded RDFa information
  • Provides targeted extraction via XPath expressions
  • Returns an RDFlib Graph object for further processing or serialization
  • Implements robust error handling and informative error messages
  • Offers verbose mode with detailed logging

Parameters

  • xmlfile (str): File path or URL to a TEI-XML file (must have .xml or .tei extension)
  • xpath_expr (str, optional): XPath expression to target specific elements for RDFa extraction; will otherwise target the XML root element
  • verbose (bool, default=True): Controls logging output

Dependencies

  • RDFlib: Core RDF functionality
  • lxml: XML processing and XPath support

Implementation Details

The package includes several helper functions that handle specific aspects of RDFa extraction. It implements defensive programming practices with input validation and comprehensive error handling for common issues:

  • Invalid file extensions
  • Invalid XML syntax
  • Invalid URLs or file paths
  • Invalid XPath syntax
  • Erroneous XPath queries

Error messages provide contextual information to facilitate debugging and resolution.

Installation

pip install tei-rdfa

Example Usage

from tei_rdfa import tei_rdfa

# Basic usage
graph = tei_rdfa('path/to/document.xml')

# With XPath to target specific elements
graph = tei_rdfa(
    xmlfile='https://example.org/document.tei',
    xpath_expr='//tei:person[2]',
    verbose=True
)

# Process resulting graph
print(graph.serialize(format='turtle'))

Directory Structure

tei_rdfa/
├── LICENSE
├── README.md
├── pyproject.toml
└── tei_rdfa/
    ├── __init__.py
    ├── requirements.txt
    └── ipynb/
        └── tei_rdfa.ipynb

The repository is organized as follows:

  • tei_rdfa/ contains project metadata and configuration
  • tei_rdfa/tei_rdfa/ contains the package implementation
  • tei_rdfa/tei_rdfa/ipynb/ contains a Jupyter notebook demonstrating usage examples and error scenarios

[^1]: See https://github.com/TEIC/TEI/issues/1860.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tei_rdfa-0.1.0b4.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tei_rdfa-0.1.0b4-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file tei_rdfa-0.1.0b4.tar.gz.

File metadata

  • Download URL: tei_rdfa-0.1.0b4.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tei_rdfa-0.1.0b4.tar.gz
Algorithm Hash digest
SHA256 1cd008561250116cb8cc35d6c43e48562827c45ff2c4dcd93e621f261bd85bdf
MD5 3eccf4e915c454dab74735907ec4e54e
BLAKE2b-256 5b3b18030f7c80214f11e426fd2a894520516153fdaee029db711fbaa9d7c172

See more details on using hashes here.

File details

Details for the file tei_rdfa-0.1.0b4-py3-none-any.whl.

File metadata

  • Download URL: tei_rdfa-0.1.0b4-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for tei_rdfa-0.1.0b4-py3-none-any.whl
Algorithm Hash digest
SHA256 87fecbd19f53aa1ffa1f89d34e2144801dd5fc92d1eb869fd849ae8ae47ed5d9
MD5 02176be8d55e34c9c853f32bbc0a1e2f
BLAKE2b-256 5d217f1c3394f58419363c3d15865e477d0038381746a37824cf0e4e5d3df0ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page