FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)
FoLiA is an XML-based format for Linguistic Annotation suitable for representing written language resources such as corpora. Its goal is to unify a variety of linguistic annotations in one single rich format, without committing to any particular standard annotation set. Instead, it seeks to accommodate any desired system or tagset, and so offer maximum flexibility. This makes FoLiA language independent. Due to its generalised set up, it is easy to extend the FoLiA format to suit your custom needs for linguistic annotation.
XML is an inherently hierarchic format. FoLiA does justice to this by utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a broader paradigm inspired by the KAF (KYOTO Annotation Format or Knowledge Annotation Format), FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters will be provided. This may entail some loss of information if the simpler format has no provisions for particular types of information specified in the FoLiA format.
Notable features are:
- XML-based, UTF-8 encoded
- Language and tagset independent
- Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
- Generalised paradigm, extensible and flexible
- Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.
- FoLiA is currently being integrated in NLP software developed at the ILK Research Group: Ucto, a generic tokenizer, and Frog, a Dutch morpho-syntactic processor.
FoLiA was written by Maarten van Gompel.