Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging
This package implements part-of-speech tagging of Riksdagens Protokoll
Parla-CLARIN XML files.
Update riksprot tagger system
If pyriksprot_tagger repository folder already exists:
% cd "pyriksprot-tagger-folder"
% git pull
If repository folder doesn't exist:
% cd "some-folder"
% git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git
Update configuration
Update configurational elements in "pyriksprot-tagger-folder"/.env:
Environment variable | Description |
---|---|
RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
SPARV_DATADIR | Sparv data folder |
STANZA_DATADIR | Stanza data folder |
OMP_NUM_THREADS | Number of threads to use |
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10
Create or update Riksdagens Corpus data repository
% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository:
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps
Update / tag a new version of RIKSPROT:
Prerequisites:
- Pull latest version of welfare-state-analytics/pyriksprot_tagger
- Update configuration (see above)
If you want to use snakemake:
- Edit options (target name) in workflow/config/config.yml
- Run make annotate (ca: 10 hours run time)
If you want to use tag-it script (preferred, faster):
- Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &
Create metadata database:
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make metadata
Create speech corpus
- Pull or clone latest version of welfare-state-analytics/pyriksprot
- Update configuration (specify tag) to use in pyriksprot/.env
- Run make extract-speeches-to-feather
How to annotate protocols using snakemake (not recommended)
- Annotate using default settings.
make annotate
- Update a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
- Call snakemake directly:
$ nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install from PyPI (not recommended)
Verify current Python version (pyenv
is recommended for easy switch between versions).
- Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
- Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline
To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyriksprot_tagger-2023.4.2.tar.gz
(23.4 kB
view hashes)
Built Distribution
Close
Hashes for pyriksprot_tagger-2023.4.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11e42843f2d81ecfdefb9976e752f16789b845302d8dfc2c66414953bd128441 |
|
MD5 | 229302035ce53bf77ae70dc14a2c2f02 |
|
BLAKE2b-256 | a5bf8a31318e53afe6b1dfba7ff7942186c4930225ad27ebd91d483adddecb4e |
Close
Hashes for pyriksprot_tagger-2023.4.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7cfe8363a3a62ee6a5a9df9c0bbfc5ad055c6bc1a0495345c4f84a1a40fcbd0 |
|
MD5 | d5b00c02a1e0e27395dcbe79f8c609da |
|
BLAKE2b-256 | ff703aa2291baab6564c7c45bd328ab64c92e35ed17ece7cccd46b3f9c372f03 |