Skip to main content

Pipeline that tags pyriksprot Parla-Clarin XML files

Project description

Riksdagens Protokoll Part-Of-Speech Tagging

This package implements part-of-speech tagging of Riksdagens Protokoll Parla-CLARIN XML files.

Update riksprot tagger system

If pyriksprot_tagger repository folder already exists:

% cd "pyriksprot-tagger-folder"
% git pull

If repository folder doesn't exist:

% cd "some-folder"
% git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git

Update configuration

Update configurational elements in "pyriksprot-tagger-folder"/.env:

Environment variable Description
RIKSPROT_DATA_FOLDER Parent folder (location) of Riksdagens corpus data folder
RIKSPROT_REPOSITORY_URL https://github.com/welfare-state-analytics/riksdagen-corpus.git
RIKSPROT_REPOSITORY_TAG Target corpus version. Must be a valid Github tag
SPARV_DATADIR Sparv data folder
STANZA_DATADIR Stanza data folder
OMP_NUM_THREADS Number of threads to use
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10

Create or update Riksdagens Corpus data repository

% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository:
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps

Update / tag a new version of RIKSPROT:

Prerequisites:

  • Pull latest version of welfare-state-analytics/pyriksprot_tagger
  • Update configuration (see above)

If you want to use snakemake:

  • Edit options (target name) in workflow/config/config.yml
  • Run make annotate (ca: 10 hours run time)

If you want to use tag-it script (preferred, faster):

  • Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &

Create metadata database:

  • Pull or clone latest version of welfare-state-analytics/pyriksprot
  • Update configuration (specify tag) to use in pyriksprot/.env
  • Run make metadata

Create speech corpus

  • Pull or clone latest version of welfare-state-analytics/pyriksprot
  • Update configuration (specify tag) to use in pyriksprot/.env
  • Run make extract-speeches-to-feather

How to annotate protocols using snakemake (not recommended)

  • Annotate using default settings.
make annotate
  • Update a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1
  • Call snakemake directly:
$ nohup make annotate PROCESSES_COUNT=4 >& run.log &
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &

Install from PyPI (not recommended)

Verify current Python version (pyenv is recommended for easy switch between versions).

  • Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate
  • Install the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipeline

To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.

cd /some/folder/pyriksprot
source .venv/bin/activate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyriksprot_tagger-2023.4.2.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

pyriksprot_tagger-2023.4.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file pyriksprot_tagger-2023.4.2.tar.gz.

File metadata

  • Download URL: pyriksprot_tagger-2023.4.2.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.1 Linux/5.4.0-144-generic

File hashes

Hashes for pyriksprot_tagger-2023.4.2.tar.gz
Algorithm Hash digest
SHA256 11e42843f2d81ecfdefb9976e752f16789b845302d8dfc2c66414953bd128441
MD5 229302035ce53bf77ae70dc14a2c2f02
BLAKE2b-256 a5bf8a31318e53afe6b1dfba7ff7942186c4930225ad27ebd91d483adddecb4e

See more details on using hashes here.

File details

Details for the file pyriksprot_tagger-2023.4.2-py3-none-any.whl.

File metadata

  • Download URL: pyriksprot_tagger-2023.4.2-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.11.1 Linux/5.4.0-144-generic

File hashes

Hashes for pyriksprot_tagger-2023.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a7cfe8363a3a62ee6a5a9df9c0bbfc5ad055c6bc1a0495345c4f84a1a40fcbd0
MD5 d5b00c02a1e0e27395dcbe79f8c609da
BLAKE2b-256 ff703aa2291baab6564c7c45bd328ab64c92e35ed17ece7cccd46b3f9c372f03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page