Converts FoLiA and TEI files to Alpino XML files
Project description
CHAT, FoLiA, PaQu metadata, plaintext and TEI to Alpino XML or PaQu metadata format
Converts CHAT, FoLiA, PaQu metadata, plaintext and TEI XML files to Alpino XML files. Each sentence in the input file is parsed separately.
Usage
Command Line
pip install corpus2alpino
corpus2alpino -s localhost:7001 folia.xml -o alpino.xml
Or from project root:
python -m corpus2alpino -s localhost:7001 folia.xml -o alpino.xml
Library
from corpus2alpino.converter import Converter
from corpus2alpino.annotators.alpino import AlpinoAnnotator
from corpus2alpino.collectors.filesystem import FilesystemCollector
from corpus2alpino.targets.memory import MemoryTarget
from corpus2alpino.writers.lassy import LassyWriter
alpino = AlpinoAnnotator("localhost", 7001)
converter = Converter(FilesystemCollector(["folia.xml"]),
# Not needed when using the PaQuWriter
annotators=[alpino],
# This can also be ConsoleTarget, FilesystemTarget
target=MemoryTarget(),
# Set to merge treebanks, also possible to use PaQuWriter
writer=LassyWriter(True))
# get the Alpino XML output, combined into one treebank XML file
parses = converter.convert()
print(''.join(parses)) # <treebank><alpino_ds ... /></treebank>
Enrichment
It is possible to add custom properties to (existing) Lassy/Alpino files. This is done using a csv-file containing the node attributes and values to look for and the custom properties to assign.
For example:
python -m corpus2alpino tests/example_lassy.xml -e tests/enrichment.csv -of lassy
See corpus2alpino.annotators.enrich_lassy
for more information.
Development
Unit Test
python -m unittest
Upload to PyPi
See: https://packaging.python.org/tutorials/packaging-projects/#generating-distribution-archives
Make sure setuptools
and wheel
are installed. Then from the virtualenv:
python setup.py sdist bdist_wheel
twine upload dist/*
Requirements
- Alpino parser running as a server:
Alpino batch_command=alpino_server -notk server_port=7001
- Python 3.5 or higher (developed using 3.6.3).
- libfolia-dev
- libicu-dev
- libxml2-dev
- libticcutils2-dev
- libucto-dev
- ucto Note: a newer version might be needed than provided in Ubuntu.
Installation Instructions for Ubuntu
sudo apt install libfolia-dev libicu-dev libxml2-dev libticcutils2-dev ucto libucto-dev
pip install -r requirements.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for corpus2alpino-0.3.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c05304bf6348e7f8b964dffa4ea37f393786acf3d014c92d0ed8d655a301d11 |
|
MD5 | 94f67ac981112559afdcdfc667cf8f34 |
|
BLAKE2b-256 | 068e376a64b4ddad88271f86ebb226f9062eaa6e48e5f6bb62e222f72b1938b9 |