Skip to main content

Reads .xml/.txt-files and parses these with TreeTagger

Project description

treetagger-xml

This is a simple script (process.py) that reads in a .xml-file in the OPUS format, uses TreeTagger to parse/lemmatize each sentence, an appends this information to the word elements in the original .xml-file. The script also facilitates tagging a .txt-file and then converting the tab-separated output from TreeTagger to the OPUS format.

Requirements

TreeTagger

See the TreeTagger website for installation instructions. Note that you'll have to download a paramater file for each language you would want to tag/lemmatize. This script has been tested on version 3.2.1 of TreeTagger.

Python

This script runs in Python 3 and requires two external packages to run: lxml and treetaggerwrapper. The latter requires six to be installed as well. You can install these packages either locally (in a virtualenv) or globally via running:

pip install -r requirements.txt

Running the script

Before running the script, it's best to set an environment variable with the location of TreeTagger. The treetaggerwrapper tries to detect the installation automatically, but this is not fool-proof. You can set the environment variable (under Linux) with:

export TAGDIR=/opt/treetagger/

Alternatively, you can modify process.py and hard-code your installation path in the TreeTagger instantation.

Then, you can run the process.py script. It requires three parameters: your input format (xml or txt), your language of choice for parsing and lemmatizing, and your input file(s). In the examples/ directory you can find some example .xml-files. Run

python process.py xml en examples/en.xml

to process the English example. The resulting file will be named examples/en-out.xml.

Processing plain text

Processing plain text requires you to set the first argument to txt rather than xml. For example:

python process.py txt en examples/en.txt

This script will output a tab-separated file (examples/en.tab) as well as an .xml-file in the OPUS format (examples/en.xml).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

treetagger-xml-0.1.9.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

treetagger_xml-0.1.9-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file treetagger-xml-0.1.9.tar.gz.

File metadata

  • Download URL: treetagger-xml-0.1.9.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.2

File hashes

Hashes for treetagger-xml-0.1.9.tar.gz
Algorithm Hash digest
SHA256 bb5b2087019cc39a7b8d01fda42f09913088890f8ae637d5bd917d93ec0e4d32
MD5 e0730a533430ba9f5d7c0d5429a03199
BLAKE2b-256 ba15eeec41718b9717e0dc8d64c597a1101d1560b38aeb85da06b70c59b87ae9

See more details on using hashes here.

File details

Details for the file treetagger_xml-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: treetagger_xml-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.2

File hashes

Hashes for treetagger_xml-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 5af1d448f9f09d128dbffee041134ca9f9c7bd7f4b76ef4bdc84f1d577832ff1
MD5 5360ce5aa03934fd0da3565ae06439eb
BLAKE2b-256 7baf97bfece467caa26c771f86f213abd9472f9b51c299e4585ef2ebd5f57553

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page