Skip to main content

Reads .xml-files and parses these with TreeTagger

Project description

treetagger-xml

This is a simple script (process.py) that reads in a .xml-file, uses TreeTagger to parse/lemmatize each sentence, and then to output the input file with the tags and lemmata appended to the word elements.

Requirements

TreeTagger

See the TreeTagger website for installation instructions. Note that you'll have to download a paramater file for each language you would want to tag/lemmatize. This script has been tested on version 3.2.1 of TreeTagger.

Python

This script runs in Python 3 and requires two external packages to run: lxml and treetaggerwrapper. The latter requires six to be installed as well. You can install these packages either locally (in a virtualenv) or globally via running:

pip install -r requirements.txt

Running the script

Before running the script, it's best to set an environment variable with the location of TreeTagger. The treetaggerwrapper tries to detect the installation automatically, but this is not fool-proof. You can set the environment variable (under Linux) with:

export TAGDIR=/opt/treetagger/

Alternatively, you can modify process.py and hard-code your installation path in the TreeTagger instantation.

Then, you can run the process.py script. It requires three parameters: your input format (xml or txt), your language of choice for parsing and lemmatizing, and your input file(s). In the examples/ directory you can find some example .xml-files. Run

python process.py xml en examples/en.xml

to process the English example. The resulting file will be named examples/en-out.xml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

treetagger-xml-0.1.5.tar.gz (4.5 kB view hashes)

Uploaded Source

Built Distribution

treetagger_xml-0.1.5-py3-none-any.whl (6.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page