Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging
This package implements part-of-speech tagging of Riksdagens Protokoll
Parla-CLARIN XML files.
Update riksprot tagger system
If pyriksprot_tagger repository folder already exists:
% cd "pyriksprot-tagger-folder"
% git pull
If repository folder doesn't exist:
% cd "some-folder"
% git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git
Update configuration
Update configurational elements in "pyriksprot-tagger-folder"/.env:
Environment variable | Description |
---|---|
RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
SPARV_DATADIR | Sparv data folder |
STANZA_DATADIR | Stanza data folder |
OMP_NUM_THREADS | Number of threads to use |
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10
Create or update Riksdagens Corpus data repository
% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository:
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps
Update / tag a new version of RIKSPROT:
Prerequisites:
- Pull latest version of welfare-state-analytics/pyriksprot_tagger
- Update configuration (see above)
If you want to use snakemake:
- Edit options (target name) in workflow/config/config.yml
- Run make annotate (ca: 10 hours run time)
If you want to use tag-it script (preferred, faster):
- Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &
Create metadata database:
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make metadata
Create speech corpus
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make extract-speeches-to-feather
How to annotate protocols using snakemake (not recommended)
- Annotate using default settings.
make annotate
- Update a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
- Call snakemake directly:
$ nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install from PyPI (not recommended)
Verify current Python version (pyenv
is recommended for easy switch between versions).
- Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
- Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline
To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyriksprot_tagger-2023.4.1.tar.gz
(21.0 kB
view hashes)
Built Distribution
Close
Hashes for pyriksprot_tagger-2023.4.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47432737197d8e1bd0c7f8a5e40c1b5804e92591222115939978f30697e26821 |
|
MD5 | ddbb23526328cf1b7560d6a7193de530 |
|
BLAKE2b-256 | c5203cff277e282fba23fa88e1e6dad9f2076089980669ef463ae513edc2d7c4 |
Close
Hashes for pyriksprot_tagger-2023.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 004b9719edfcb36bde6f014f5b91cbaa692300ff30a0f9bcb986a1b8a0815683 |
|
MD5 | 0c2b2f3ac3482d32742454cb96560d02 |
|
BLAKE2b-256 | 490bf673afffc8e463d3c709ecf2e6c7481723af29d3c988b788105eaf6fc379 |