Skip to main content

A package that extracts and structures information from the German Bundestag's open data API

Project description

Bundestag Protocol Extractor

PyPI version Python Versions License Tests

Extract and structure data from the German Bundestag's parliamentary protocols using the official DIP API.

🚀 Quick Start

Installation

pip install bundestag-protocol-extractor

Command Line Usage

# Extract protocols from the 20th legislative period, limit to 5
bpe --period 20 --limit 5 --output-dir ./data

# Export to both CSV and JSON format
bpe --period 20 --limit 5 --format both

# Use a specific API key (optional, package includes a public key)
bpe --api-key YOUR_API_KEY --period 20 --limit 5

🔍 Overview

This package allows researchers, journalists, and political analysts to access German parliamentary protocols (plenarprotokolle) in a structured format suitable for analysis. It extracts speeches, speaker metadata, topics, and related information from the Bundestag's official API.

✨ Features

  • Extract Protocols: Access plenarprotokolle from all legislative periods
  • Structure Content: Extract individual speeches with rich metadata
  • Speaker Metadata: Get information about speakers (name, party, role)
  • Topic Analysis: Access topic information and related proceedings
  • Multiple Export Formats: Export to CSV and JSON for easy analysis
  • Automatic Rate Limiting: Robust handling of API limits with exponential backoff
  • Progress Tracking: Resume long-running extractions if interrupted
  • Flexible Configuration: Fine-tune extraction parameters based on your needs

📋 Detailed Usage

Command Line Interface

# Basic usage
bpe --period 20 --limit 5 --output-dir ./data

# List help and all available options
bpe --help

# Extract all protocols from the current legislative period
bpe --period 20 --output-dir ./data

# Use XML parsing for more detailed extraction (default)
bpe --period 20 --use-xml

# Disable XML parsing (faster but less detailed)
bpe --period 20 --no-xml

Control Output Format

# Export to CSV (default)
bpe --period 20 --format csv

# Export to JSON
bpe --period 20 --format json

# Export to both formats
bpe --period 20 --format both

# Exclude full speech text to reduce file size
bpe --period 20 --exclude-speech-text

# Include full protocol text (large files)
bpe --period 20 --include-full-protocols

Logging Options

# Enable verbose output (INFO to console, DEBUG to log file)
bpe --period 20 --verbose

# Enable full debug logging (DEBUG to both console and log file)
bpe --period 20 --debug

# Quiet mode (WARNING to console, INFO to log file)
bpe --period 20 --quiet

# Specify custom log file
bpe --period 20 --log-file /path/to/custom/log/file.log

Progress Tracking & Resumption

# List available progress files
bpe --list-progress

# Resume from a specific progress file
bpe --resume /path/to/progress_file.json

# Resume from a specific protocol
bpe --resume-from "20/123"

# Skip first N protocols
bpe --offset 25

Python API

You can also use the package directly in your Python code:

from bundestag_protocol_extractor import BundestagExtractor
import logging

# Initialize the extractor (uses default public API key)
extractor = BundestagExtractor()

# Or with your own API key
# extractor = BundestagExtractor(api_key="YOUR_API_KEY")

# Fetch protocols for a specific legislative period (20th Bundestag)
protocols = extractor.get_protocols(period=20, limit=5)

# Export to CSV (creates separate files for protocols, speeches, etc.)
exported_files = extractor.export_to_csv(
    protocols,
    output_dir="./data",
    include_speech_text=True
)

# Export to JSON (creates a single JSON file with all data)
json_path = extractor.export_to_json(protocols, output_dir="./data")

📊 Data Structure

The extracted data is organized in a relational format with multiple CSV files:

Core Files

  1. protocols.csv: Basic protocol metadata (date, title, etc.)
  2. speeches.csv: Individual speeches with speaker references
  3. persons.csv: Speaker information (name, party, role)
  4. proceedings.csv: Related parliamentary proceedings
  5. speech_topics.csv: Topics associated with each speech

Detailed Files (XML-based)

  1. paragraphs.csv: Individual paragraphs for detailed text analysis
  2. comments.csv: Comments and interjections
  3. agenda_items.csv: Agenda items for each session
  4. toc.csv: Table of contents with document structure

🔑 API Key

The package includes a public API key with limited rate allowance. For extensive usage, register for your own API key from the Bundestag DIP API:

Visit: Dokumentations- und Informationssystems für Parlamentsmaterialien (DIP) API Documentation

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bundestag_protocol_extractor-0.1.2.tar.gz (41.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bundestag_protocol_extractor-0.1.2-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file bundestag_protocol_extractor-0.1.2.tar.gz.

File metadata

File hashes

Hashes for bundestag_protocol_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9fc4290cd86515ed05396746f5c20c4cd49f09c6f598317291f922fce0b13c30
MD5 5ba79e1cd46bb68b50b6acbc941101aa
BLAKE2b-256 9cda869adc523c79d045481819a217dd8519cb5e8a0a486e1f7043794f4a6ac2

See more details on using hashes here.

File details

Details for the file bundestag_protocol_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for bundestag_protocol_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0cd9c5afbfcfa26b3f8da6f2205d66c2139b9dbe9b3906b008f60b54724afa00
MD5 cbad27c4c04eafacbc6bbb55db9370f5
BLAKE2b-256 038d59785978c8e9d57cbfe0cf42e35bc0fef711a7232c2fb44d42e35a5e6dd4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page