Skip to main content

Converts FoLiA and TEI files to Alpino XML files

Project description

Actions Status

PyPi/corpus2alpino

CHAT, FoLiA, PaQu metadata, plaintext and TEI to Alpino XML or PaQu metadata format

Converts CHAT, FoLiA, PaQu metadata, plaintext and TEI XML files to Alpino XML files. Each sentence in the input file is parsed separately.

Usage

Command Line

pip install corpus2alpino
corpus2alpino -s localhost:7001 folia.xml -o alpino.xml

Or from project root:

python -m corpus2alpino -s localhost:7001 folia.xml -o alpino.xml

Library

from corpus2alpino.converter import Converter
from corpus2alpino.annotators.alpino import AlpinoAnnotator
from corpus2alpino.collectors.filesystem import FilesystemCollector
from corpus2alpino.targets.memory import MemoryTarget
from corpus2alpino.writers.lassy import LassyWriter

alpino = AlpinoAnnotator("localhost", 7001)
converter = Converter(FilesystemCollector(["folia.xml"]),
    # Not needed when using the PaQuWriter
    annotators=[alpino],
    # This can also be ConsoleTarget, FilesystemTarget
    target=MemoryTarget(),
    # Set to merge treebanks, also possible to use PaQuWriter
    writer=LassyWriter(True))

# get the Alpino XML output, combined into one treebank XML file
parses = converter.convert()
print(''.join(parses)) # <treebank><alpino_ds ... /></treebank>

Enrichment

It is possible to add custom properties to (existing) Lassy/Alpino files. This is done using a csv-file containing the node attributes and values to look for and the custom properties to assign.

For example:

python -m corpus2alpino tests/example_lassy.xml -e tests/enrichment.csv -of lassy

See corpus2alpino.annotators.enrich_lassy for more information.

Development

Unit Test

python -m unittest

Upload to PyPi

See: https://packaging.python.org/tutorials/packaging-projects/#generating-distribution-archives

Make sure setuptools and wheel are installed. Then from the virtualenv:

python setup.py sdist bdist_wheel
twine upload dist/*

Requirements

Installation Instructions for Ubuntu

sudo apt install libfolia-dev libxml2-dev
pip install -r requirements.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus2alpino-0.3.10.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpus2alpino-0.3.10-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file corpus2alpino-0.3.10.tar.gz.

File metadata

  • Download URL: corpus2alpino-0.3.10.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.6

File hashes

Hashes for corpus2alpino-0.3.10.tar.gz
Algorithm Hash digest
SHA256 ca5232c6da1dda55bd0ace399f5da7cad6c99347049eadd88159e64e9d9bd322
MD5 fc2dbf184daaa8e778aa0cc42dcb8754
BLAKE2b-256 9331448944a804d6ed9d576bcfc18cb63ba480f933d1829b07d10716b8ee883b

See more details on using hashes here.

File details

Details for the file corpus2alpino-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: corpus2alpino-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.6

File hashes

Hashes for corpus2alpino-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 c8a8de9c41bab8f01a112516ce234e3fac76bef388c43631674a070e2e2b6d7b
MD5 85385a43629378c8decfab873dd6b84b
BLAKE2b-256 92745c8657e86d1b7b8c8c117e615e70ccf7faec8d7e4672b34471ad814283bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page