Skip to main content

Pipeline that transforms Parla-Clarin XML files

Project description

Riksdagens Protokoll Part-Of-Speech Tagging (Parla-Clarin Workflow)

This package implements Stanza part-of-speech annotation of Riksdagens Protokoll Parla-Clarin XML files.

Prerequisites

  • A bash-enabled environment (Linux or Git Bash on windows)
  • Git
  • Python 3.8.5^
  • GNU make (install i)

Parla-Clarin to penelope pipeline

How to install

How to configure

How to setup data

Riksdagens corpus

Create a shallow clone (no history) of repository:

make init-repository

Sync shallow clone with changes on origin (Github):

make update-repositoryupdate_repository_timestamps

Update modified date of repository file. This is necessary since the pipeline uses last commit date of each XML-files to determine which files are outdated, whilst git clone sets current time.

$ make update-repository-timestamps
or
$ scripts/git_update_mtime.sh path-to-repository

How to annotate speeches

make annotate
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &

Windows:

poetry shell
bash
nohup poetry run snakemake -j4 -j4 --keep-going --keep-target-files &

Run a specific year:

poetry shell
bash
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &

Install

(This workflow will be simplified)

Verify current Python version (pyenv is recommended for easy switch between versions).

Create a new Python virtual environment (sandbox):

cd /some/folder
mkdir westac_parlaclarin_pipeline
cd westac_parlaclarin_pipeline
python -m venv .venv
source .venv/bin/activate

Install the pipeline and run setup script.

pip install westac_parlaclarin_pipeline
setup-pipeline

Initialize local clone of Parla-CLARIN repository

Run PoS tagging

Move to sandbox and activate virtual environment:

cd /some/folder/westac_parlaclarin_pipeline
source .venv/bin/activate

Update repository:

make update-repository
make update-repository-timestamps

Update all (changed) annotations:

make annotate

Update a single year (and set cpu count):

make annotate YEAR=1960 CPU_COUNT=1

Configuration

work_folders: !work_folders &work_folders
  data_folder: /data/riksdagen_corpus_data

parla_clarin: !parla_clarin &parla_clarin
  repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
  repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
  repository_branch: dev
  folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus

extract_speeches: !extract_speeches &extract_speeches
  folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
  template: speeches.cdata.xml
  extension: xml

word_frequency: !word_frequency &word_frequency
  <<: *work_folders
  filename: riksdagen-corpus-term-frequencies.pkl

dehyphen: !dehyphen &dehyphen
  <<: *work_folders
  whitelist_filename: dehyphen_whitelist.txt.gz
  whitelist_log_filename: dehyphen_whitelist_log.pkl
  unresolved_filename: dehyphen_unresolved.txt.gz

config: !config
    work_folders: *work_folders
    parla_clarin: *parla_clarin
    extract_speeches: *extract_speeches
    word_frequency: *word_frequency
    dehyphen: *dehyphen
    annotated_folder: /data/riksdagen_corpus_data/annotated

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

westac-parlaclarin-pipeline-2021.11.3.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file westac-parlaclarin-pipeline-2021.11.3.tar.gz.

File metadata

  • Download URL: westac-parlaclarin-pipeline-2021.11.3.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.5 Linux/5.10.60.1-microsoft-standard-WSL2

File hashes

Hashes for westac-parlaclarin-pipeline-2021.11.3.tar.gz
Algorithm Hash digest
SHA256 7d2eac2f042a40999b64e330eaa1eec526754775de9bb7e6771032eacfb78df3
MD5 4819b85446ee51854c616fe13851d3f2
BLAKE2b-256 bb72325fff0b606f9ea1cd5bd8c532e8acf42f91b5078e31fd908d6ce6215966

See more details on using hashes here.

File details

Details for the file westac_parlaclarin_pipeline-2021.11.3-py3-none-any.whl.

File metadata

File hashes

Hashes for westac_parlaclarin_pipeline-2021.11.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c17772fcae4387525d89b31b45a9ca1c7585f88dc8aaf34e3a59c699645c10d7
MD5 4c8917601b1a24974c87cfa6259b772b
BLAKE2b-256 99c2ec1cdba1b1abbc50f6daf996bce6cf695c94baea0089d130360f2219aa61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page