Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging
This package implements part-of-speech tagging of Riksdagens Protokoll
(Parla-CLARIN)[https://clarin-eric.github.io/parla-clarin/] files.
Prerequisites
- The workflow makes use of
Gnu make
,git
,pyenv
andpoetry
. - Latest version of welfare-state-analytics/pyriksprot.
- Latest version of welfare-state-analytics/pyriksprot_tagger.
- NLP models for Sparv and Stanza installed.
- A local copy of riksdagen-corpus Github repository.
Tagging Workflow
- Update
pyriksprot-tagger
configuration file (.env). - Update riksdagen-corpus repository.
- Run the
tag.sh
script:PYTHONPATH=. nohup ./tag.sh --target-folder /path/to/output/data > tag-it.version.log &
You can also execute a predefined make recepi:make tag-it
If you run tag.sh
without parameters then the values found in .env
will be used. You can also specify
parameters as command line options:
usage: ./tag.sh [--data-folder folder] [--source-pattern pattern] --target-folder folder --tag tag [--force] [--update] [--max-procs n]]
Creates new database using source as template. Source defaults to production.
--data-folder source root folder
--source-pattern source folder pattern
--target-folder target folder
--tag source corpus tag
--force drop target if exists
--update update target if exists
--max-procs max number of parallel jobs
Note that tag.sh
will raise an error if the checkout tag in the Git repository and tag specified in .env (or as a parameter) mismatch.
Metadata Workflow
This workflow processes the corpus metadata and generates an Sqlite relational database. This database is used by the Westac Notebooks when filtering and pivoting data based on speaker, party etc. Use welfare-state-analytics/pyriksprot to create or update the metadata:
- Update
pyriksprot/.env
and set current tag. - Run the
make metadata
to create a metadata database for current tag:
Detailed workflow
Due to potentiallyy breaking changes in the metadata we need to find differences between the new and old version of the metadata. If new fields or coded values have been added or change, or any other breaking change has been made then most likely the scripts that processes the metadata needs to be updated. Data updates are made both using SQL scripts and Python scripts.
-
Identify breaking changes.
- Download previous and current metadata in two seperate folders:
metadata2db download v0.9.0 ./tmp/metadata/v0.9.0 metadata2db download v0.10.0 ./tmp/metadata/v0.10.0
💡 Alt:
python pyriksprot/scripts/metadata2db.py download v0.10.0 ./tmp/metadata/v0.10.0
💡 Use moshfeu.compare-folders to compare folders in vscode.
- If you find structural differences than you need to file an issue and request the system to be updated to deal with the changes. Module
pyriksprot.sql
contains SQL scripts for metadata schema and (some) updates. Furthermore, some schema changes need to be handled in thepyriksprot.module
module (e.g.pyriksprot.module.config
). Changes may of course also affect thepenelope
corpus pipeline.
- Download previous and current metadata in two seperate folders:
-
Create a metadata database using welfare-state-analytics/pyriksprot for given tag:
- Update
pyriksprot/.env
(e.g. tag) - Run the
metadata
recipe:make metadata
- Update
Speech Corpus Workflow
- Create a default speech corpus using welfare-state-analytics/pyriksprot_tagger for given tag:
- Run te recipi
extract-speeches-to-feather
:make extract-speeches-to-feather
- Run te recipi
See appendix below if you instead want to use snakemake
for updating repository and tagging,
Install pyriksprot tagger
Easiest way is to clone the GitHub repository:
cd /path/to/any/folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger
cd pyriksprot_tagger
pyenv local 3.11.3
poetry shell
pip install torch
poetry install
You can also install the tagger in an isolated Python virtual environment. This method requires you to manually download certain scripts depending on your specific workflow.
Install Sparv and Stanza models
Use stanza-models.sh
script to download Stanza files. Note that the target folder specified in the script must be the same as the folder specified by the STANZA_DATADIR environment variable (in .env).
Optional: Use penelope/scripts/install-spacy-models.sh
to install relevant SpaCy models.
Update configuration
Update or create dotenv (.env) in the pyriksprot_tagger
folder with the following variable definitions:
Environment variable | Description |
---|---|
RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
SPARV_DATADIR | Sparv data folder |
STANZA_DATADIR | Stanza data folder |
RIKSPROT_DATA_FOLDER="/path/to/data/folder"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="vx.y.z"
SPARV_DATADIR="/path/to/sparv_datadir"
STANZA_DATADIR="/path/to/stanza_datadir"
Appendix
Setup a local copy of riksdagen-corpus Github repository
If riksdagen-corpus repository folder already exists, then do an update:
cd /path/to/git/repository
git pull
If repository folder doesn't exist:
cd /path/to/parent-folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git
You need to checkout the specific tag that you want to process:
cd /path/to/git/repository
git checkout vx.y.z
Make sure to update file timestamps to latest commit timestamp!
cd /path/to/pyriksprot-tagger
./pyriksprot_tagger/scripts/update-timestamps
Install pyriksprot-tagger from PyPI
Verify current Python version (pyenv
is recommended for easy switch between versions).
Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline
To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activate
Create or update the repository using snakemake (not recommended)
This is an alternative way of updating the corpus repository.
% cd /path/to/pyriksprot-tagger/folder
If you want to create a new clone of the repository:
% make full-clone-repository
If you want to update an existing repository:
% make full-pull-repository
If you want to save space and do a shallow clone
% make shallow-update-repository
Update timestamp of repository work folder files to match last commit timestamp. Important! This is required if you use Snakemake when tagging:
% make update-repository-timestamps
How to annotate protocols using snakemake (not recommended)
Annotate using default settings:
make annotate
Annotate a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
Call snakemake directly:
nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyriksprot_tagger-2024.6.2.tar.gz
.
File metadata
- Download URL: pyriksprot_tagger-2024.6.2.tar.gz
- Upload date:
- Size: 31.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/5.4.0-176-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 161d784bd4694e69be1358d48f8140d13e95c3f14aaf8b59247657745528e277 |
|
MD5 | 8472fae039dbf1f1a1eca67665530c10 |
|
BLAKE2b-256 | 8b3f256c1db6e973fcd56e2fc53bf91dd8cc992c47a5a96cb7d0785a411d4586 |
File details
Details for the file pyriksprot_tagger-2024.6.2-py3-none-any.whl
.
File metadata
- Download URL: pyriksprot_tagger-2024.6.2-py3-none-any.whl
- Upload date:
- Size: 37.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/5.4.0-176-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddcc41103764796961d4fab271a4ed9fab109c09df8cc40297ed095b9223e04a |
|
MD5 | c5101577d5939e7328073f9cd09f4d9e |
|
BLAKE2b-256 | 2d0ace6ace02b9b04eb893d905674284a618d3c7d98dafa4fc1adae8dc79634e |