Skip to main content

Pipeline that tags pyriksprot Parla-Clarin XML files

Project description

Riksdagens Protokoll Part-Of-Speech Tagging (Parla-Clarin Workflow)

This package implements Stanza part-of-speech annotation of Riksdagens Protokoll Parla-Clarin XML files.

Prerequisites

  • A bash-enabled environment (Linux or Git Bash on windows)
  • Git
  • Python 3.8.5^
  • GNU make (install i)

Parla-Clarin to penelope pipeline

How to install

How to configure

How to setup data

Riksdagens corpus

Create a shallow clone (no history) of repository:

make init-repository

Sync shallow clone with changes on origin (Github):

make update-repositoryupdate_repository_timestamps

Update modified date of repository file. This is necessary since the pipeline uses last commit date of each XML-files to determine which files are outdated, whilst git clone sets current time.

$ make update-repository-timestamps
or
$ scripts/git_update_mtime.sh path-to-repository

How to annotate speeches

make annotate
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &

Windows:

poetry shell
bash
nohup poetry run snakemake -j4 -j4 --keep-going --keep-target-files &

Run a specific year:

poetry shell
bash
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &

Install

(This workflow will be simplified)

Verify current Python version (pyenv is recommended for easy switch between versions).

Create a new Python virtual environment (sandbox):

cd /some/folder
mkdir westac_parlaclarin_pipeline
cd westac_parlaclarin_pipeline
python -m venv .venv
source .venv/bin/activate

Install the pipeline and run setup script.

pip install westac_parlaclarin_pipeline
setup-pipeline

Initialize local clone of Parla-CLARIN repository

Run PoS tagging

Move to sandbox and activate virtual environment:

cd /some/folder/westac_parlaclarin_pipeline
source .venv/bin/activate

Update repository:

make update-repository
make update-repository-timestamps

Update all (changed) annotations:

make annotate

Update a single year (and set cpu count):

make annotate YEAR=1960 CPU_COUNT=1

Configuration

work_folders: !work_folders &work_folders
  data_folder: /data/riksdagen_corpus_data

parla_clarin: !parla_clarin &parla_clarin
  repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
  repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
  repository_branch: main
  folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus

extract_speeches: !extract_speeches &extract_speeches
  folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
  template: speeches.cdata.xml
  extension: xml

word_frequency: !word_frequency &word_frequency
  <<: *work_folders
  filename: riksdagen-corpus-term-frequencies.pkl

dehyphen: !dehyphen &dehyphen
  <<: *work_folders
  whitelist_filename: dehyphen_whitelist.txt.gz
  whitelist_log_filename: dehyphen_whitelist_log.pkl
  unresolved_filename: dehyphen_unresolved.txt.gz

config: !config
    work_folders: *work_folders
    parla_clarin: *parla_clarin
    extract_speeches: *extract_speeches
    word_frequency: *word_frequency
    dehyphen: *dehyphen
    annotated_folder: /data/riksdagen_corpus_data/annotated

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyriksprot_tagger-2021.12.2.tar.gz (22.6 kB view hashes)

Uploaded Source

Built Distribution

pyriksprot_tagger-2021.12.2-py3-none-any.whl (27.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page