Package to extract parliamentary protocols from the German Bundestag in structured form.

These details have not been verified by PyPI

Project links

Project description

Bundestag Protocol Extractor

Extract and structure data from the German Bundestag's parliamentary protocols using the official DIP API.

🚀 Quick Start

Installation

pip install bundestag-protocol-extractor

Command Line Usage

# Extract protocols from the 20th legislative period, limit to 5
bpe --period 20 --limit 5 --output-dir ./data

# Export to both CSV and JSON format
bpe --period 20 --limit 5 --format both

# Use a specific API key (optional, package includes a public key)
bpe --api-key YOUR_API_KEY --period 20 --limit 5

🔍 Overview

This package allows researchers, journalists, and political analysts to access German parliamentary protocols (plenarprotokolle) in a structured format suitable for analysis. It extracts speeches, speaker metadata, topics, and related information from the Bundestag's official API.

✨ Features

Extract Protocols: Access plenarprotokolle from all legislative periods
Structure Content: Extract individual speeches with rich metadata
Speaker Metadata: Get information about speakers (name, party, role)
Topic Analysis: Access topic information and related proceedings
Multiple Export Formats: Export to CSV and JSON for easy analysis
Automatic Rate Limiting: Robust handling of API limits with exponential backoff
Progress Tracking: Resume long-running extractions if interrupted
Flexible Configuration: Fine-tune extraction parameters based on your needs
Multi-strategy Extraction: Tiered extraction approach with automatic fallbacks
Quality Tracking: Detailed extraction metadata for research transparency
XML Caching: Efficient storage and retrieval of previously downloaded documents
Pattern Recognition: Sophisticated text pattern matching for speech extraction

📋 Detailed Usage

Command Line Interface

# Basic usage
bpe --period 20 --limit 5 --output-dir ./data

# List help and all available options
bpe --help

# Extract all protocols from the current legislative period
bpe --period 20 --output-dir ./data

# Enable XML caching for faster subsequent runs (default)
bpe --period 20 --enable-xml-cache

# Disable XML caching
bpe --period 20 --disable-xml-cache

# Specify a custom cache directory
bpe --period 20 --cache-dir /path/to/cache/dir

# Enable automatic repair of malformed XML (default)
bpe --period 20 --repair-xml

# Disable XML repair
bpe --period 20 --no-repair-xml

Control Output Format

# Export to CSV (default)
bpe --period 20 --format csv

# Export to JSON
bpe --period 20 --format json

# Export to both formats
bpe --period 20 --format both

# Exclude full speech text to reduce file size
bpe --period 20 --exclude-speech-text

# Include full protocol text (large files)
bpe --period 20 --include-full-protocols

Logging Options

# Enable verbose output (INFO to console, DEBUG to log file)
bpe --period 20 --verbose

# Enable full debug logging (DEBUG to both console and log file)
bpe --period 20 --debug

# Quiet mode (WARNING to console, INFO to log file)
bpe --period 20 --quiet

# Specify custom log file
bpe --period 20 --log-file /path/to/custom/log/file.log

Progress Tracking & Resumption

# List available progress files
bpe --list-progress

# Resume from a specific progress file
bpe --resume /path/to/progress_file.json

# Resume from a specific protocol
bpe --resume-from "20/123"

# Skip first N protocols
bpe --offset 25

Python API

You can also use the package directly in your Python code:

from bundestag_protocol_extractor import BundestagExtractor
import logging

# Initialize the extractor (uses default public API key)
extractor = BundestagExtractor()

# Or with your own API key and XML options
# extractor = BundestagExtractor(
#     api_key="YOUR_API_KEY",
#     enable_xml_cache=True,
#     cache_dir="./cache",
#     repair_xml=True
# )

# Fetch protocols for a specific legislative period (20th Bundestag)
protocols = extractor.get_protocols(period=20, limit=5)

# Export to CSV (creates separate files for protocols, speeches, etc.)
exported_files = extractor.export_to_csv(
    protocols,
    output_dir="./data",
    include_speech_text=True
)

# Export to JSON (creates a single JSON file with all data)
json_path = extractor.export_to_json(protocols, output_dir="./data")

📊 Data Structure

The extracted data is organized in a relational format with multiple CSV files:

Core Files

protocols.csv: Basic protocol metadata (date, title, etc.)
speeches.csv: Individual speeches with speaker references
persons.csv: Speaker information (name, party, role)
proceedings.csv: Related parliamentary proceedings
speech_topics.csv: Topics associated with each speech

Detailed Files (XML-based)

paragraphs.csv: Individual paragraphs for detailed text analysis
comments.csv: Comments and interjections
agenda_items.csv: Agenda items for each session
toc.csv: Table of contents with document structure

Extraction Quality Metadata

Each extracted speech includes quality metadata fields:

extraction_method: The method used to extract the speech text:
- xml: Extracted from structured XML (highest quality)
- pattern: Extracted using pattern matching from text
- page: Extracted from page references only (lower quality)
- none: No text extraction was possible
extraction_status: The status of the extraction:
- complete: Successfully extracted full text
- partial: Only partial text was extracted
- failed: Extraction failed
extraction_confidence: A confidence score from 0.0 to 1.0:
- 1.0: High confidence (XML extraction)
- 0.6-0.8: Medium confidence (pattern matching)
- 0.1-0.5: Low confidence (page-based extraction)
- 0.0: No confidence (extraction failed)

These fields allow researchers to filter speeches based on extraction quality for their analyses.

Data Science Integration

The package includes tools specifically designed for data science workflows:

Quality Reports: Comprehensive reports on extraction quality with detailed statistics
Interactive Visualizations: Charts and graphs for analyzing extraction quality
Pandas Integration: Helper functions for working with extracted data in pandas
Jupyter Notebook Example: Example workflow for analyzing extraction data

Each export automatically includes:

HTML Quality Report: Interactive report with visualizations
JSON Quality Data: Machine-readable quality statistics
Quality Visualizations: PNG charts showing extraction distributions
Helper Columns: Boolean fields for easy filtering in pandas

See the examples/data_science_workflow.ipynb notebook for a detailed demonstration of how to work with the data in a research context.

🔑 API Key

The package includes a public API key with limited rate allowance. For extensive usage, register for your own API key from the Bundestag DIP API:

Visit: Dokumentations- und Informationssystems für Parlamentsmaterialien (DIP) API Documentation

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

Clone the repository:

git clone https://github.com/maxboettinger/bundestag-protocol-extractor.git
cd bundestag-protocol-extractor

Create a conda environment:

conda env create -f environment.yml
conda activate bundestag-protocol-extractor

Install the package in development mode:
```
pip install -e ".[dev]"
```
Run tests:
```
pytest
```

Making a Release

The package includes a comprehensive release script that verifies package integrity:

python scripts/release.py [major|minor|patch]

The release process:

Runs all tests including import verification
Builds distribution packages
Verifies the built package in a virtual environment
Ensures critical modules like utils.logging are included
Uploads to PyPI (with confirmation)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.3

Mar 25, 2025

0.3.1

Mar 25, 2025

0.3.0

Mar 25, 2025

0.1.12

Mar 25, 2025

0.1.5

Mar 23, 2025

0.1.4

Mar 19, 2025

0.1.3

Mar 19, 2025

0.1.2

Mar 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bundestag_protocol_extractor-0.3.3.tar.gz (348.6 kB view details)

Uploaded Mar 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bundestag_protocol_extractor-0.3.3-py3-none-any.whl (66.9 kB view details)

Uploaded Mar 25, 2025 Python 3

File details

Details for the file bundestag_protocol_extractor-0.3.3.tar.gz.

File metadata

Download URL: bundestag_protocol_extractor-0.3.3.tar.gz
Upload date: Mar 25, 2025
Size: 348.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for bundestag_protocol_extractor-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`906da93198d3606b259ba00d2569b2c84d6e8155d5034c0a1aab11631f5aa4b8`
MD5	`f9d2291f06d1a44dbc244f08d886fc25`
BLAKE2b-256	`96d4f62cced32b97331fb994dfe82819e1661cb869e140c1b088d61a057a15d2`

See more details on using hashes here.

File details

Details for the file bundestag_protocol_extractor-0.3.3-py3-none-any.whl.

File metadata

Download URL: bundestag_protocol_extractor-0.3.3-py3-none-any.whl
Upload date: Mar 25, 2025
Size: 66.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for bundestag_protocol_extractor-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`411c23ff60dddb10d985e8591ec1ea2a15db9e792efbf1552b694da6ce6a00bd`
MD5	`1b7144e9f9aad31c8c1bbe562077dc7c`
BLAKE2b-256	`50a95a7382cef3553dd160b38910456e100cde0d918180676ee874fd15c0b89d`

See more details on using hashes here.

bundestag-protocol-extractor 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bundestag Protocol Extractor

🚀 Quick Start

Installation

Command Line Usage

🔍 Overview

✨ Features

📋 Detailed Usage

Command Line Interface

Control Output Format

Logging Options

Progress Tracking & Resumption

Python API

📊 Data Structure

Core Files

Detailed Files (XML-based)

Extraction Quality Metadata

Data Science Integration

🔑 API Key

🤝 Contributing

Development Setup

Making a Release

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes