Skip to main content

Piereling is a webservice and web-application to convert between a variety of document formats, mostly from and to FoLiA XML. It is intended for NLP pipelines.

Project description

https://travis-ci.com/proycon/piereling.svg?branch=master http://applejack.science.ru.nl/lamabadge.php/piereling Project Status: Active – The project has reached a stable, usable state and is being actively developed. Latest release in the Python Package Index Piereling Logo

Piereling is a webservice and web-application to convert between a variety of document formats and to and from the Format for Linguistic Annotation (FoLiA). It is intended to be used in Natural Language Processing pipelines. Piereling itself does not actually implement the convertors but relies on numerous other specialised conversion tools in combination with notable third-party tools such as pandoc to accomplish its goals.

Piereling is the word for earthworm in Limburgish dialect. Data conversion forms the groundwork for linguistic annotation, and thse little worms, eating the input file on one end and secreting a conversion on its outer end, perform that job.

We use FoLiA as our pivot format so you will mostly encounter conversions from or to FoLiA. FoLiA is a format for Linguistic Annotation that also incorporates elaborate document structure and mark-up facilities. Another important intermediate format used in many of our conversions through pandoc is ReStructuredText, a lightweight markup format. Although, Pandoc support a huge number of conversions between all its supported document formats, it is beyond the scope of his project to offer those in the webservice.

Available Conversions

Conversions to FoLiA

From Document and Markup Formats

  • from plain text; uses txt2folia from FoLiA-Tools.

    • In addition to an attempted extraction of text structure (paragraphs) by detecting blank lines, it also supports one-sentence-per-line and one-paragraph-per-line styles.

    • If you can deliver your input as ReStructuredText or Markdown then this is is strongly preferred if you want to preserve structure and markup, as these formats resolve a lot of ambiguity inherent in unspecified plain text.

    • Information loss: None

  • from ReStructuredText; using rst2folia from FoLiA-Tools.

    • Information loss: Minimal to None (almost all rst structures are supported)

  • from Markdown; via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Minimal to None (most markdown structures are supported; exceptions are mathematical formula)

  • from HTML; via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some; complex layout, complex tables, and imagery will generally get lost. Should only be used for semantically clean and simple HTML rather than complex presentational HTML from the web.

  • from Word (Office Open XML, docx); via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some; complex layout, complex tables, and imagery will generally get lost.

    • Note that the Word 2007 DOC format from up until 2007 is not supported, only the modern DOCX variant.

  • from OpenDocument Text (LibreOffice, odt); via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some; complex layout, complex tables, and imagery will generally get lost.

  • from EPUB; via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some; complex layout, complex tables, and imagery will generally get lost.

  • from LaTeX; via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some to considerable; complex layout, complex tables, custom packages, math, and imagery will generally get lost.

  • from MediaWiki (as used by Wikipedia); via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Some; complex layout, complex tables. Wikipedia specific elements.

  • from DocBook; via ReStructuredText using pandoc and then rst2folia from FoLiA-Tools.

    • Information loss: Unknown

  • from TEI P5 XML (Text Encoding Initiative); uses tei2folia from FoLiA-Tools.

    • TEI is a very extensive and flexible format with many different forms

    • Information loss: Our converter will only work for a certain subset of TEI, mostly equivalent to TEI Lite, and may fail on others. Though we support a lot of TEI elements, there is also still a lot that is not covered by the converter. There will be comments in the output for anything that could not be converted properly.

  • from PDF; uses pdftotext from Poppler and then txt2folia from FoLiA-tools.

    • Only works for PDFs with embedded text, not for imagery which would require OCR!

    • Information loss: Considerable! PDF conversion is notoriously difficult, the layout of the document will most probably get lost in the conversion (especially in case of multi-columned output). The markup will get lost too.

    • Structural conversion is very inaccurate (i.e. paragraphs will not be nicely mapped) and produces ugly FoLiA.

    • Always avoid this conversion if you can!

  • from hOCR; uses FoLiA-hocr from foliautils.

    • hOCR is a standard format outputted by OCR systems such as Tesseract.

    • Information loss: Unknown

  • from ALTO; uses FoLiA-alto from foliautils.

    • ALTO is a standard format for the description of text OCR and layout information of pages for digitized material.

    • Information loss: Unknown

From other Linguistic Annotation Formats

  • from NAF (NLP Annotation Format) to FoLiA; uses naf2folia from NAFFoLiAPy.

    • This converter is still in an early and experimental stage.

    • Information loss: Not all annotation layers are implemented yet. Those that are should suffer minimal to no information loss. See the website for details.

  • from CONLL-U; uses conllu2folia from FoLiA-Tools.

    • Information loss: None

  • from Alpino XML; uses alpino2folia from FoLiA-Tools.

    • Information loss: Minimal to None

Conversions from FoLiA

  • to plain text, uses folia2txt from FoLiA-Tools.

    • Information loss: Considerable, as only the text will be outputted and any annotations, most structure, and all markup will be lost. The text itself, however, will be very accurately converted, in either tokenised (if available) or untokenised form.

  • to HTML; this conversion is offered through the default viewer in the web-interface.

    • Information loss: Minimal, but information is represented purely for presentational purposes rather than focussing on semantics.

  • to ReStructuredText, uses folia2rst from FoLiA-Tools.

    • Information loss: Structure and mark-up will be preserved, but annotations will be lost!

Validation & Upgrade

  • FoLiA validation; using foliavalidator from FoLiA-Tools.

  • FoLiA upgrade; upgrades an older FoLiA version to a newer one (mostly inteneded for FoLiA v1 to FoLiA v2); uses foliaupgrade from FoLiA-Tools.

Installation

Install using pip (preferably in a Python virtual environment):

pip install piereling

Piereling is supplied as part of our LaMachine distribution, which includes all dependencies out of the box. If you don’t use this, you will need to take care of installing certain dependencies yourself in order for all convertors to work, this includes:

For production use, we recommend using uwsgi in combination with a webserver such as Apache (with mod_uwsgi_proxy), or nginx. A uwsgi configuration has been generated (piereling.example.ini); it is specific to the host you deploy the webservice on. This in turn loads the wsgi script (piereling.wsgi), which loads your webservice.

Sample configurations for nginx and Apache have been generated as a starting point, add these to your server and then use the ./startserver_production.sh script to launch CLAM using uwsgi. If you use LaMachine, all this has already been set up for you.

Usage

Run clamservice piereling.piereling to start the development server and then navigate your browser to the address printed.

Web

Piereling is a RESTful webservice and also provides a web-interface for human end users (powered by CLAM). If you instead seek to do conversions locally on the command line then you have no need for Piereling and should simply invoke the aforementioned conversion tools directly.

A public instance of this webservice is available at https://webservices-lst.science.ru.nl/piereling, register for a free account at https://webservices-lst.science.ru.nl first.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piereling-0.2.tar.gz (18.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page