Pipeline that transforms Parla-Clarin XML files
Project description
Riksdagens Protokoll Part-Of-Speech Tagging (Parla-Clarin Workflow)
This package implements Stanza part-of-speech annotation of Riksdagens Protokoll
Parla-Clarin XML files.
Prerequisites
- A bash-enabled environment (Linux or Git Bash on windows)
- Git
- Python 3.8.5^
- GNU make (install i)
Parla-Clarin to penelope pipeline
How to install
How to configure
How to setup data
Riksdagens corpus
Create a shallow clone (no history) of repository:
make init-repository
Sync shallow clone with changes on origin (Github):
make update-repositoryupdate_repository_timestamps
Update modified date of repository file. This is necessary since the pipeline uses last commit date of
each XML-files to determine which files are outdated, whilst git clone
sets current time.
$ make update-repository-timestamps
or
$ scripts/git_update_mtime.sh path-to-repository
How to annotate speeches
make annotate
or
$ nohup poetry run snakemake -j4 --keep-going --keep-target-files &
Windows:
poetry shell
bash
nohup poetry run snakemake -j4 -j4 --keep-going --keep-target-files &
Run a specific year:
poetry shell
bash
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &
Install
(This workflow will be simplified)
Verify current Python version (pyenv
is recommended for easy switch between versions).
Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir westac_parlaclarin_pipeline
cd westac_parlaclarin_pipeline
python -m venv .venv
source .venv/bin/activate
Install the pipeline and run setup script.
pip install westac_parlaclarin_pipeline
setup-pipeline
Initialize local clone of Parla-CLARIN repository
Run PoS tagging
Move to sandbox and activate virtual environment:
cd /some/folder/westac_parlaclarin_pipeline
source .venv/bin/activate
Update repository:
make update-repository
make update-repository-timestamps
Update all (changed) annotations:
make annotate
Update a single year (and set cpu count):
make annotate YEAR=1960 CPU_COUNT=1
Configuration
work_folders: !work_folders &work_folders
data_folder: /data/riksdagen_corpus_data
parla_clarin: !parla_clarin &parla_clarin
repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
repository_branch: dev
folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus
extract_speeches: !extract_speeches &extract_speeches
folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
template: speeches.cdata.xml
extension: xml
word_frequency: !word_frequency &word_frequency
<<: *work_folders
filename: riksdagen-corpus-term-frequencies.pkl
dehyphen: !dehyphen &dehyphen
<<: *work_folders
whitelist_filename: dehyphen_whitelist.txt.gz
whitelist_log_filename: dehyphen_whitelist_log.pkl
unresolved_filename: dehyphen_unresolved.txt.gz
config: !config
work_folders: *work_folders
parla_clarin: *parla_clarin
extract_speeches: *extract_speeches
word_frequency: *word_frequency
dehyphen: *dehyphen
annotated_folder: /data/riksdagen_corpus_data/annotated
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file westac-parlaclarin-pipeline-2021.11.3.tar.gz
.
File metadata
- Download URL: westac-parlaclarin-pipeline-2021.11.3.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.5 Linux/5.10.60.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d2eac2f042a40999b64e330eaa1eec526754775de9bb7e6771032eacfb78df3 |
|
MD5 | 4819b85446ee51854c616fe13851d3f2 |
|
BLAKE2b-256 | bb72325fff0b606f9ea1cd5bd8c532e8acf42f91b5078e31fd908d6ce6215966 |
File details
Details for the file westac_parlaclarin_pipeline-2021.11.3-py3-none-any.whl
.
File metadata
- Download URL: westac_parlaclarin_pipeline-2021.11.3-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.5 Linux/5.10.60.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c17772fcae4387525d89b31b45a9ca1c7585f88dc8aaf34e3a59c699645c10d7 |
|
MD5 | 4c8917601b1a24974c87cfa6259b772b |
|
BLAKE2b-256 | 99c2ec1cdba1b1abbc50f6daf996bce6cf695c94baea0089d130360f2219aa61 |