Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging
This package implements part-of-speech tagging of Riksdagens Protokoll
Parla-CLARIN XML files.
Update riksprot tagger system
If pyriksprot_tagger repository folder already exists:
% cd "pyriksprot-tagger-folder"
% git pull
If repository folder doesn't exist:
% cd "some-folder"
% git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git
Update configuration
Update configurational elements in "pyriksprot-tagger-folder"/.env:
Environment variable | Description |
---|---|
RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
SPARV_DATADIR | Sparv data folder |
STANZA_DATADIR | Stanza data folder |
OMP_NUM_THREADS | Number of threads to use |
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10
Create or update Riksdagens Corpus data repository
% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository:
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps
Update / tag a new version of RIKSPROT:
Prerequisites:
- Pull latest version of welfare-state-analytics/pyriksprot_tagger
- Update configuration (see above)
If you want to use snakemake:
- Edit options (target name) in workflow/config/config.yml
- Run make annotate (ca: 10 hours run time)
If you want to use tag-it script (preferred, faster):
- Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &
Create metadata database:
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make metadata
Create speech corpus
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make extract-speeches-to-feather
How to annotate protocols using snakemake (not recommended)
- Annotate using default settings.
make annotate
- Update a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
- Call snakemake directly:
$ nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install from PyPI (not recommended)
Verify current Python version (pyenv
is recommended for easy switch between versions).
- Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
- Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline
To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyriksprot_tagger-2023.4.3.tar.gz
(27.5 kB
view hashes)
Built Distribution
Close
Hashes for pyriksprot_tagger-2023.4.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 322fbf5af192105cab1aaf656690da6010c5f362ca585d640432075109c4061d |
|
MD5 | af3ba2200665afa19531be53ad1955e4 |
|
BLAKE2b-256 | fec27fbff82b8faba76bec2afaf936d010e0836e333235141310906da5be140e |
Close
Hashes for pyriksprot_tagger-2023.4.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 486598cee330cd46b834a49112859197272960451fed75bd14ef08766ed29dd4 |
|
MD5 | 3c51fd74150097e48b4e0a866765ef73 |
|
BLAKE2b-256 | f694301181f4d71451fcf414d32020a91071ded8c8cbbc217e8bb7f589a574c8 |