Pipeline that transforms Parla-Clarin XML files
Project description
Parla-Clarin Workflow
This package implements Stanza part-of-speech annotation of ParlaClarin XML files.
Prerequisites
- Git
- Python 3.8.5^
- GNU make
- Poetry
Install
Clone this repository:
- cd a-project-directory-of-your-choosing
- git clone git@github.com:welfare-state-analytics/westac_parlaclarin_pipeline
- cd westac_parlaclarin_pipeline
Or install python package:
- poetry init --python 3.8.5
- poetry install westac_parlaclarin_pipeline
- poetry install westac_parlaclarin_pipeline
Run annotation
Update repository:
make update-repository
make update-repository-timestamps
Update all (changed) annotations:
make annotate
Update a single year (and set cpu count):
make annotate YEAR=1960 CPU_COUNT=1
Configuration
work_folders: !work_folders &work_folders
data_folder: /data/riksdagen_corpus_data
parla_clarin: !parla_clarin &parla_clarin
repository_folder: /data/riksdagen_corpus_data/riksdagen-corpus
repository_url: https://github.com/welfare-state-analytics/riksdagen-corpus.git
repository_branch: dev
folder: /data/riksdagen_corpus_data/riksdagen-corpus/corpus
extract_speeches: !extract_speeches &extract_speeches
folder: /data/riksdagen_corpus_data/riksdagen-corpus-exports/speech_xml
template: speeches.cdata.xml
extension: xml
word_frequency: !word_frequency &word_frequency
<<: *work_folders
filename: riksdagen-corpus-term-frequencies.pkl
dehyphen: !dehyphen &dehyphen
<<: *work_folders
whitelist_filename: dehyphen_whitelist.txt.gz
whitelist_log_filename: dehyphen_whitelist_log.pkl
unresolved_filename: dehyphen_unresolved.txt.gz
config: !config
work_folders: *work_folders
parla_clarin: *parla_clarin
extract_speeches: *extract_speeches
word_frequency: *word_frequency
dehyphen: *dehyphen
annotated_folder: /data/riksdagen_corpus_data/annotated
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for westac-parlaclarin-pipeline-2021.9.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43fcf08439aa02a7010819f27a6d9187a7dcb079c57dfb89a10351f113b20489 |
|
MD5 | cae97e93bc266bb4e7dba754ceba43a7 |
|
BLAKE2b-256 | bbbbcc9a8a39b29c55a8470acf680e7a031aaf8aa18293a38f271e1334073b2d |
Close
Hashes for westac_parlaclarin_pipeline-2021.9.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | feddd098970a3a3b1debbd1b48befaf9cf6c10e666784f5a7c6cf85d1348320e |
|
MD5 | 06bf3bcc499c33c856652e1fe43a1763 |
|
BLAKE2b-256 | aa0f446bc22a7308fd8f8cb468332d350436999e1e80daa76ee2fd369a98ce45 |