Python API for Riksdagens Protokoll

These details have not been verified by PyPI

Project links

Project description

Python package for reading and tagging Riksdagens Protokoll

Batteries (tagger) not included.

Overview

This package is intended to cover the following use cases:

Extract "text documents" from the Parla-CLARIN XML files

Text can be extracted from the XML files at different granularity (paragraphs, utterance, speech, who, protocol). The text can be grouped (combined) into larger temporal blocks based on time (year, lustrum, decade or custom periods). Within each of these block the text in turn can be grouped by speaker attributes (who, party, gender).

The text extraction can done using the riksprot2text utility, which is a CLI interface installed with the package, or in Python code using the API that this package exposes. The Python API exposed both streaming (SAX based) methods and a domain model API (i.e. Python classes representing protocols, speeches and utterances).

Both the CLI and the API supports dehyphenation using method described in Anföranden: Annotated and Augmented Parliamentary Debates from Sweden, Stian Rødven Eide, 2020. The API also supports user defined text transformations.

Extract PoS-tagged versions of the Parla-CLARIN XML files

Part-of-speech tagged versions of the protocols can be extracted with the same granularity and aggregation as described above for the raw text. The returned documents are tab-separated files with fields for text, baseform and pos-tag (UPOS, XPOS). Note that the actual part-of-speech tagging is done using tools found in the pyriksprot_tagging repository (link).

Currently there are no open-source tagged versions of the corpos avaliable. The tagging is done using Stanza with Swedish language models produced and made publically avaliable by Språkbanken Text.

Store extracted text

The extracted text can be stored as optionally compressed plain text files on disk, or in a ZIP-archive.

Pre-requisites

Python >=3.11
A folder containing the Riksdagen Protokoll (parliamentary protocols) Github repository.

cd some-folder \
git clone --branch "tag" tags/"tag" --depth 1 https://github.com/welfare-state-analytics/riksdagen-corpus.git
cd riksdagen-corpus
git config core.quotepath off

Installation (Linux)

Create an new isolated virtual environment for pyriksprot:

mkdir /path/to/new/pyriksprot-folder
cd /path/to/new/pyriksprot-folder
python -m venv .venv

Activate the environment:

cd /path/to/new/pyriksprot-folder
source .venv/bin/activate

Install pyriksprot in activated virtual environment.

pip install pyriksprot

CLI riksprot2text: Extract aggregated text corpus from Parla-CLARIN XML files

λ riksprot2text --help

Usage: riksprot2text [OPTIONS] SOURCE_FOLDER TARGET

Options:
  -m, --mode [plain|zip|gzip|bz2|lzma]
                                  Target type
  -t, --temporal-key TEXT         Temporal partition key(s)
  -y, --years TEXT                Years to include in output
  -g, --group-key TEXT            Partition key(s)
  -p, --processes INTEGER RANGE   Number of processes to use
  -l, --segment-level [protocol|speech|utterance|paragraph|who]
                                  Protocol extract segment level
  -e, --keep-order                Keep output in filename order (slower, multiproc)

  -s, --skip-size INTEGER RANGE   Skip blocks of char length less than
  -d, --dedent                    Remove indentation
  -k, --dehyphen                  Dehyphen text
  --help                          Show this message and exit.

  λ metadata2db --help
Usage: metadata2db.py [OPTIONS] COMMAND [ARGS]...

  CLI tool to manage riksprot metadata

Options:
  --help  Show this message and exit.

Commands:
  columns
  database
  download
  filenames
  index

  λ metadata2db.py database --help
Usage: metadata2db.py database [OPTIONS] TARGET

Options:
  --tag TEXT             Metadata version
  --source-folder TEXT
  --force                Force overwrite
  --load-index           Load utterance index
  --scripts-folder TEXT  Apply scripts in specified folder to DB. If not
                         specified the scripts are loaded from SQL-module.
  --skip-scripts         Skip loading SQL scripts
  --help                 Show this message and exit.


  λ metadata2db index --help
Usage: metadata2db.py index [OPTIONS] CORPUS_FOLDER TARGET_FOLDER

Options:
  --help  Show this message and exit.

Examples CLI

Aggregate text per year grouped by speaker. Store result in a single zip. Skip documents less than 50 characters.

riksprot2text /path/to/corpus output.zip -m zip -t year -l protocol -g who --skip-size 50

Aggregate text per decade grouped by speaker. Store result in a single zip. Remove indentations and hyphenations.

riksprot2text /path/to/corpus output.zip -m zip -t decade -l who -g who --dedent --dehyphen

Aggregate text using customized temporal periods and grouped by party.

riksprot2text /path/to/corpus output.zip -m zip -t "1920-1938,1929-1945,1946-1989,1990-2020" -l who -g party

Aggregate text per document and group by gender and party.

riksprot2text /path/to/corpus output.zip -m zip -t protocol -l who -g party -g gender

Aggregate text per year grouped by gender and party and include only 1946-1989.

riksprot2text /path/to/corpus output.zip -m zip -t year -l who -g party -g gender -y 1946-1989

Python API - Iterate XML protocols

Aggregate text per year grouped by speaker. Store result in a single zip. Skip documents less than 50 characters.

import pyriksprot

target_filename: str = f'output.zip'
opts = {
    'source_folder': '/path/to/corpus',
    'target': 'outout.zip',
    'target_type': 'files-in-zip',
    'segment_level': SegmentLevel.Who,
    'dedent': True,
    'dehyphen': False,
    'years': '1955-1965',
    'temporal_key': TemporalKey.Protocol,
    'group_keys': (GroupingKey.Party, GroupingKey.Gender),
}

pyriksprot.extract_corpus_text(**opts)

Iterate over protocol and speaker:

from pyriksprot import interface, iterstors

items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
    filenames=filenames, segment_level=SegmentLevel.Who, segment_skip_size=0, processes=4
)

for item in items:
    print(item.who, len(item.text))

Iterate over protocol and speech, skip empty:

from pyriksprot import interface, iterstors

items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
    filenames=filenames, segment_level=SegmentLevel.Who, segment_skip_size=1, processes=4
)

for item in items:
    print(item.who, len(item.text))

Iterate over protocol and speech, apply preprocess function(s):

from pyriksprot import interface, iterstors
import ftfy  # pip install ftfy
import unidecode

fix_text: Callable[[str], str] = pyriksprot.compose(
    [str.lower, pyriksprot.dedent, ftfy.fix_character_width, unidecode.unidecode ]
)
items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
    filenames=filenames, segment_level=SegmentLevel.Speech, segment_skip_size=1, processes=4, preprocessor=fix_text,
)

for item in items:
    print(item.who, len(item.text))

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2024.10.3

Oct 3, 2024

2024.10.2

Oct 2, 2024

2024.10.1

Oct 2, 2024

2024.9.2

Sep 23, 2024

2024.6.1

Jun 18, 2024

2023.12.2

Dec 4, 2023

2023.8.4

Sep 7, 2023

2023.8.3

Aug 31, 2023

2023.8.2

Aug 28, 2023

2023.4.8

Aug 24, 2023

2023.4.7

Aug 24, 2023

2023.4.6

Aug 23, 2023

2023.4.5

May 26, 2023

2023.4.4

May 15, 2023

2023.4.3

Apr 19, 2023

2023.4.2

Apr 17, 2023

2023.4.1

Apr 13, 2023

2022.5.2

May 11, 2022

2022.3.2

Mar 31, 2022

2022.3.1

Mar 31, 2022

2022.1.1

Jan 6, 2022

2021.12.11

Jan 5, 2022

2021.12.10

Jan 1, 2022

2021.12.9

Dec 29, 2021

2021.12.8

Dec 21, 2021

2021.12.7

Dec 20, 2021

2021.12.6

Dec 16, 2021

2021.12.5

Dec 16, 2021

2021.12.4

Dec 14, 2021

2021.12.3

Dec 13, 2021

2021.12.2

Nov 30, 2021

2021.9.8

Nov 30, 2021

2021.9.7

Sep 23, 2021

2021.9.6

Sep 16, 2021

2021.9.5

Sep 15, 2021

2021.9.4

Sep 14, 2021

2.0.1

Jan 27, 2026

This version

2.0.0

May 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyriksprot-2.0.0.tar.gz (1.6 MB view details)

Uploaded May 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyriksprot-2.0.0-py3-none-any.whl (1.6 MB view details)

Uploaded May 23, 2025 Python 3

File details

Details for the file pyriksprot-2.0.0.tar.gz.

File metadata

Download URL: pyriksprot-2.0.0.tar.gz
Upload date: May 23, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.8 Linux/5.4.0-212-generic

File hashes

Hashes for pyriksprot-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4bb534edd3af9e9ec1292cd47ed3488cb9ab0d6ce15ca0f78a25d69f0159cbb5`
MD5	`fa6f761de72bbf504900173dd4d4e58b`
BLAKE2b-256	`94dcb6a0b2e7fc62061f8db24df61da43490790f6532d56713ba21f5d06818c4`

See more details on using hashes here.

File details

Details for the file pyriksprot-2.0.0-py3-none-any.whl.

File metadata

Download URL: pyriksprot-2.0.0-py3-none-any.whl
Upload date: May 23, 2025
Size: 1.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.8 Linux/5.4.0-212-generic

File hashes

Hashes for pyriksprot-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`962b0ddbd7bf769e3b5248573cc716b066c36a0bbea17bd033e40d5f68f195f6`
MD5	`d6eb50781b54bde388119502f7626e7f`
BLAKE2b-256	`ab59be795516ac9a950f0b4f5990c2fab1ec8547b5c041a57510e981ae570afd`

See more details on using hashes here.

pyriksprot 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python package for reading and tagging Riksdagens Protokoll

Overview

Extract "text documents" from the Parla-CLARIN XML files

Extract PoS-tagged versions of the Parla-CLARIN XML files

Store extracted text

Pre-requisites

Installation (Linux)

CLI riksprot2text: Extract aggregated text corpus from Parla-CLARIN XML files

Examples CLI

Python API - Iterate XML protocols

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes