Pipeline that tags pyriksprot Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging (Parla-Clarin Workflow)
This package implements Stanza part-of-speech annotation of Riksdagens Protokoll
Parla-Clarin XML files.
Prerequisites
- A bash-enabled environment (Linux or Git Bash on windows)
- Git
- Python 3.8.5^
- GNU make (install i)
Parla-Clarin to penelope pipeline
How to install
How to configure
How to setup data
Riksdagens corpus
Create a shallow clone (no history) of repository:
make init-repository
Sync shallow clone with changes on origin (Github):
make update-repositoryupdate_repository_timestamps
Update modified date of repository file. This is necessary since the pipeline uses last commit date of
each XML-files to determine which files are outdated, whilst git clone
sets current time.
$ make update-repository-timestamps
or
$ scripts/git_update_mtime.sh path-to-repository
How to annotate speeches
make annotate
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
Windows:
poetry shell
bash
nohup poetry run snakemake -j4 -j4 --keep-going --keep-target-files &
Run a specific year:
poetry shell
bash
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install
(This workflow will be simplified)
Verify current Python version (pyenv
is recommended for easy switch between versions).
Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir westac_parlaclarin_pipeline
cd westac_parlaclarin_pipeline
python -m venv .venv
source .venv/bin/activate
Install the pipeline and run setup script.
pip install westac_parlaclarin_pipeline
setup-pipeline
Initialize local clone of Parla-CLARIN repository
Run PoS tagging
Move to sandbox and activate virtual environment:
cd /some/folder/westac_parlaclarin_pipeline
source .venv/bin/activate
Update repository:
make update-repository
make update-repository-timestamps
Update all (changed) annotations:
make annotate
Update a single year (and set cpu count):
make annotate YEAR=1960 CPU_COUNT=1
Configuration
work_folders: !work_folders &work_folders
data_folder: /data/riksdagen_corpus_data
parla_clarin: !parla_clarin &parla_clarin
repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
repository_branch: main
folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus
extract_speeches: !extract_speeches &extract_speeches
folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
template: speeches.cdata.xml
extension: xml
word_frequency: !word_frequency &word_frequency
<<: *work_folders
filename: riksdagen-corpus-term-frequencies.pkl
dehyphen: !dehyphen &dehyphen
<<: *work_folders
whitelist_filename: dehyphen_whitelist.txt.gz
whitelist_log_filename: dehyphen_whitelist_log.pkl
unresolved_filename: dehyphen_unresolved.txt.gz
config: !config
work_folders: *work_folders
parla_clarin: *parla_clarin
extract_speeches: *extract_speeches
word_frequency: *word_frequency
dehyphen: *dehyphen
annotated_folder: /data/riksdagen_corpus_data/annotated
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyriksprot_tagger-2021.12.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 424727e5ab216fc097c1b050939f2194fc3acf392d9603d09c4982b00f4fcd0b |
|
MD5 | bda6eeff74c562bb3d8b5177e72ea0e4 |
|
BLAKE2b-256 | 22e20bf376085380576a629022b753bcb46d2e3bc9653297a5d2a3b599c1ecce |
Hashes for pyriksprot_tagger-2021.12.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64d7560e2cb8dd6e4e4cb0d343eaa18c9ca970a479d943ccb61963179ded2ca7 |
|
MD5 | 40b98307a2e0aa2dabb5bd9a4d639198 |
|
BLAKE2b-256 | 5bf6295c9e6e7ead189e1e6dbde4ce9467828b6a4d023bd6aebad5fab485ec5b |