Pipeline that tags pyriksprot Parla-Clarin XML files

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Riksdagens Protokoll Part-Of-Speech Tagging (Parla-Clarin Workflow)

This package implements Stanza part-of-speech annotation of Riksdagens Protokoll Parla-Clarin XML files.

Prerequisites

A bash-enabled environment (Linux or Git Bash on windows)
Git
Python 3.8.5^
GNU make (install i)

Parla-Clarin to penelope pipeline

How to install

How to configure

How to setup data

Riksdagens corpus

Create a shallow clone (no history) of repository:

make init-repository

Sync shallow clone with changes on origin (Github):

make update-repositoryupdate_repository_timestamps

Update modified date of repository file. This is necessary since the pipeline uses last commit date of each XML-files to determine which files are outdated, whilst git clone sets current time.

$ make update-repository-timestamps
or
$ scripts/git_update_mtime.sh path-to-repository

How to annotate speeches

make annotate
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &

Windows:

poetry shell
bash
nohup poetry run snakemake -j4 -j4 --keep-going --keep-target-files &

Run a specific year:

poetry shell
bash
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &

Install

(This workflow will be simplified)

Verify current Python version (pyenv is recommended for easy switch between versions).

Create a new Python virtual environment (sandbox):

cd /some/folder
mkdir westac_parlaclarin_pipeline
cd westac_parlaclarin_pipeline
python -m venv .venv
source .venv/bin/activate

Install the pipeline and run setup script.

pip install westac_parlaclarin_pipeline
setup-pipeline

Initialize local clone of Parla-CLARIN repository

Run PoS tagging

Move to sandbox and activate virtual environment:

cd /some/folder/westac_parlaclarin_pipeline
source .venv/bin/activate

Update repository:

make update-repository
make update-repository-timestamps

Update all (changed) annotations:

make annotate

Update a single year (and set cpu count):

make annotate YEAR=1960 CPU_COUNT=1

Configuration

work_folders: !work_folders &work_folders
  data_folder: /data/riksdagen_corpus_data

parla_clarin: !parla_clarin &parla_clarin
  repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
  repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
  repository_branch: main
  folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus

extract_speeches: !extract_speeches &extract_speeches
  folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
  template: speeches.cdata.xml
  extension: xml

word_frequency: !word_frequency &word_frequency
  <<: *work_folders
  filename: riksdagen-corpus-term-frequencies.pkl

dehyphen: !dehyphen &dehyphen
  <<: *work_folders
  whitelist_filename: dehyphen_whitelist.txt.gz
  whitelist_log_filename: dehyphen_whitelist_log.pkl
  unresolved_filename: dehyphen_unresolved.txt.gz

config: !config
    work_folders: *work_folders
    parla_clarin: *parla_clarin
    extract_speeches: *extract_speeches
    word_frequency: *word_frequency
    dehyphen: *dehyphen
    annotated_folder: /data/riksdagen_corpus_data/annotated

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2023.12.2

Dec 4, 2023

2023.8.2

Aug 28, 2023

2023.4.4

Aug 24, 2023

2023.4.3

May 16, 2023

2023.4.2

Apr 17, 2023

2023.4.1

Apr 13, 2023

2023.3.1

Mar 27, 2023

This version

2021.12.2

Dec 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyriksprot_tagger-2021.12.2.tar.gz (22.6 kB view hashes)

Uploaded Dec 16, 2021 Source

Built Distribution

pyriksprot_tagger-2021.12.2-py3-none-any.whl (27.4 kB view hashes)

Uploaded Dec 16, 2021 Python 3

Hashes for pyriksprot_tagger-2021.12.2.tar.gz

Hashes for pyriksprot_tagger-2021.12.2.tar.gz
Algorithm	Hash digest
SHA256	`424727e5ab216fc097c1b050939f2194fc3acf392d9603d09c4982b00f4fcd0b`
MD5	`bda6eeff74c562bb3d8b5177e72ea0e4`
BLAKE2b-256	`22e20bf376085380576a629022b753bcb46d2e3bc9653297a5d2a3b599c1ecce`

Hashes for pyriksprot_tagger-2021.12.2-py3-none-any.whl

Hashes for pyriksprot_tagger-2021.12.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64d7560e2cb8dd6e4e4cb0d343eaa18c9ca970a479d943ccb61963179ded2ca7`
MD5	`40b98307a2e0aa2dabb5bd9a4d639198`
BLAKE2b-256	`5bf6295c9e6e7ead189e1e6dbde4ce9467828b6a4d023bd6aebad5fab485ec5b`