Reads .xml/.txt-files and parses these with TreeTagger
Project description
treetagger-xml
This is a simple script (process.py
) that reads in a .xml-file in the OPUS format, uses TreeTagger to parse/lemmatize each sentence, an appends this information to the word elements in the original .xml-file.
The script also facilitates tagging a .txt-file and then converting the tab-separated output from TreeTagger to the OPUS format.
Requirements
TreeTagger
See the TreeTagger website for installation instructions. Note that you'll have to download a paramater file for each language you would want to tag/lemmatize. This script has been tested on version 3.2.1 of TreeTagger.
Python
This script runs in Python 3 and requires two external packages to run: lxml and treetaggerwrapper. The latter requires six to be installed as well. You can install these packages either locally (in a virtualenv) or globally via running:
pip install -r requirements.txt
Running the script
Before running the script, it's best to set an environment variable with the location of TreeTagger. The treetaggerwrapper tries to detect the installation automatically, but this is not fool-proof. You can set the environment variable (under Linux) with:
export TAGDIR=/opt/treetagger/
Alternatively, you can modify process.py
and hard-code your installation path in the TreeTagger instantation.
Then, you can run the process.py
script. It requires three parameters: your input format (xml or txt), your language of choice for parsing and lemmatizing, and your input file(s). In the examples/
directory you can find some example .xml-files. Run
python process.py xml en examples/en.xml
to process the English example. The resulting file will be named examples/en-out.xml
.
Processing plain text
Processing plain text requires you to set the first argument to txt
rather than xml
. For example:
python process.py txt en examples/en.txt
This script will output a tab-separated file (examples/en.tab
) as well as an .xml-file in the OPUS format (examples/en.xml
).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file treetagger-xml-0.1.9.tar.gz
.
File metadata
- Download URL: treetagger-xml-0.1.9.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb5b2087019cc39a7b8d01fda42f09913088890f8ae637d5bd917d93ec0e4d32 |
|
MD5 | e0730a533430ba9f5d7c0d5429a03199 |
|
BLAKE2b-256 | ba15eeec41718b9717e0dc8d64c597a1101d1560b38aeb85da06b70c59b87ae9 |
File details
Details for the file treetagger_xml-0.1.9-py3-none-any.whl
.
File metadata
- Download URL: treetagger_xml-0.1.9-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5af1d448f9f09d128dbffee041134ca9f9c7bd7f4b76ef4bdc84f1d577832ff1 |
|
MD5 | 5360ce5aa03934fd0da3565ae06439eb |
|
BLAKE2b-256 | 7baf97bfece467caa26c771f86f213abd9472f9b51c299e4585ef2ebd5f57553 |