Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging
This package implements part-of-speech tagging of Riksdagens Protokoll
Parla-CLARIN XML files.
Update riksprot tagger system
If pyriksprot_tagger repository folder already exists:
% cd "pyriksprot-tagger-folder"
% git pull
If repository folder doesn't exist:
% cd "some-folder"
% git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git
Update configuration
Update configurational elements in "pyriksprot-tagger-folder"/.env:
Environment variable | Description |
---|---|
RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
SPARV_DATADIR | Sparv data folder |
STANZA_DATADIR | Stanza data folder |
OMP_NUM_THREADS | Number of threads to use |
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10
Create or update Riksdagens Corpus data repository
% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository:
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps
Update / tag a new version of RIKSPROT:
Prerequisites:
- Pull latest version of welfare-state-analytics/pyriksprot_tagger
- Update configuration (see above)
If you want to use snakemake:
- Edit options (target name) in workflow/config/config.yml
- Run make annotate (ca: 10 hours run time)
If you want to use tag-it script (preferred, faster):
- Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &
Create metadata database:
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make metadata
Create speech corpus
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make extract-speeches-to-feather
How to annotate protocols using snakemake (not recommended)
- Annotate using default settings.
make annotate
- Update a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
- Call snakemake directly:
$ nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install from PyPI (not recommended)
Verify current Python version (pyenv
is recommended for easy switch between versions).
- Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
- Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline
To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyriksprot_tagger-2023.4.2.tar.gz
.
File metadata
- Download URL: pyriksprot_tagger-2023.4.2.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.1 Linux/5.4.0-144-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
11e42843f2d81ecfdefb9976e752f16789b845302d8dfc2c66414953bd128441
|
|
MD5 |
229302035ce53bf77ae70dc14a2c2f02
|
|
BLAKE2b-256 |
a5bf8a31318e53afe6b1dfba7ff7942186c4930225ad27ebd91d483adddecb4e
|
File details
Details for the file pyriksprot_tagger-2023.4.2-py3-none-any.whl
.
File metadata
- Download URL: pyriksprot_tagger-2023.4.2-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.11.1 Linux/5.4.0-144-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
a7cfe8363a3a62ee6a5a9df9c0bbfc5ad055c6bc1a0495345c4f84a1a40fcbd0
|
|
MD5 |
d5b00c02a1e0e27395dcbe79f8c609da
|
|
BLAKE2b-256 |
ff703aa2291baab6564c7c45bd328ab64c92e35ed17ece7cccd46b3f9c372f03
|