Skip to main content

Swedish National Library newspaper trend scraper

Project description

KB-Trend

A CLI tool for scraping historical newspaper trend data from the Swedish National Library (Kungliga biblioteket).

Features

  • Modern CLI built with Typer and Rich for a great user experience
  • Flexible keyword loading from .txt, .csv, or .tsv files
  • Proximity search support with customizable markers
  • Configuration validation via SHA256 hashing to prevent data corruption
  • SQLite database with SQLAlchemy ORM for reliable data storage
  • Type-safe with full type hints and mypy validation

Installation

Using pipx (Recommended)

pipx install kb-trend

Using pip

python -m pip install kb-trend

Development Installation

git clone https://github.com/matjoha/kb-trend
cd kb-trend
pip install -e ".[dev]"

Quick Start

1. Initialize Configuration

Run the interactive setup wizard:

kb-trend init

Or use non-interactive mode with defaults:

kb-trend init --non-interactive

This creates:

  • settings.yaml - Configuration file
  • kb_trend.sqlite3 - SQLite database
  • Wildcard query for baseline measurements

2. Add Keywords

Load keywords from a file:

# From plain text file (one keyword per line)
kb-trend add-keywords keywords.txt

# From CSV file
kb-trend add-keywords keywords.csv

# From TSV file
kb-trend add-keywords keywords.tsv

Example CSV format:

title,gender,category
gosse,male,youth
flicka,female,youth

All columns are stored as metadata, and you specify which column is the keyword in settings.yaml.

3. Run the Scraper

Execute the scraping queue:

kb-trend run

Options:

  • --limit N - Process only N items
  • --resume/--restart - Resume from last run or restart
  • --config PATH - Use alternate config file

4. Calculate Relative Frequencies

Normalize counts against baseline:

kb-trend process

5. Check Status

View database statistics:

kb-trend status

Configuration

The settings.yaml file controls all aspects of the scraper:

db_path: kb_trend.sqlite3
min_year: 1820                   # Optional: filter start year
max_year: 2020                   # Optional: filter end year
journals:                         # List of newspapers
  - "None"                        # "None" searches all journals
  - "DAGENS NYHETER"
sleep_timer: 1.0                  # Seconds between requests
request_timeout: 30               # HTTP timeout
keyword_column: "title"           # Which CSV column is the keyword
marker_templates:                 # Empty = plain search
  - "SÖKES"
  - "PLATS"
  - "ERHÅLLES"
proximity_distance: 5             # Proximity search window

Configuration Hash Validation

KB-Trend calculates a SHA256 hash of your configuration and stores it in the database. This prevents accidental data corruption if settings change after the database is created.

If you modify settings.yaml, you'll need to:

  1. Restore the original settings, or
  2. Create a new database with kb-trend init --force

Validate your configuration:

kb-trend validate

Query Types

Plain Keyword Search

When marker_templates is empty:

Query: "gosse"

Proximity Search

When markers are configured:

Query: "gosse SÖKES"~5 OR "gosse PLATS"~5 OR "gosse ERHÅLLES"~5

This finds "gosse" within 5 words of the markers.

API

KB-Trend uses the new KB.se data API:

https://data.kb.se/search/?q=PHRASE&searchGranularity=part&from=YYYY-MM-DD&to=YYYY-MM-DD&isPartOf=JOURNAL

This replaces the old Selenium-based scraping of the tidningar.kb.se interface, providing:

  • Faster, more reliable scraping
  • JSON responses instead of HTML parsing
  • No browser dependencies
  • Better error handling

Database Schema

  • metadata: Configuration hash, schema version
  • query: Search queries with metadata from CSV
  • journal: Newspaper definitions
  • counts: Hit counts by year/query/journal
  • queue: Processing queue with status tracking

CLI Commands

Command Description
kb-trend init Run configuration wizard
kb-trend add-keywords <file> Load keywords from file
kb-trend run Execute scraping queue
kb-trend process Calculate relative frequencies
kb-trend status Show database statistics
kb-trend validate Validate configuration hash
kb-trend reset Reset queue to pending

Development

Running Tests

# Run all tests with coverage
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_keywords/test_loader.py

Type Checking

mypy src/kb_trend

Linting

ruff check src/kb_trend tests

Migration from Old Version

The original KB_TrendScraper used Selenium to scrape the tidningar.kb.se interface. This new version:

  1. Uses the official KB data API (faster, more reliable)
  2. Provides a proper CLI with subcommands
  3. Supports flexible keyword file formats
  4. Validates configuration to prevent errors
  5. Has comprehensive test coverage

No automatic migration is provided. To migrate:

  1. Export your old data if needed
  2. Run kb-trend init to create new configuration
  3. Load your keywords with kb-trend add-keywords
  4. Run the scraper

License

CC BY NC 4.0

Credits

Based on the original KB_TrendScraper project, modernized with:

  • Typer for CLI
  • httpx for HTTP requests
  • SQLAlchemy for database
  • Pydantic for configuration validation
  • pytest for comprehensive testing

Citing this tool

If you use KB-Trend in your research, please cite it as:

@software{johansson2025kbtrend,
  author = {Johansson, Mathias},
  title = {{KB-Trend: Swedish National Library newspaper trend scraper}},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/DigitalHistory-Lund/kb-trend},
  license = {CC-BY-NC-4.0}
}

Or in APA format:

Johansson, M. (2025). KB-Trend: Swedish National Library newspaper trend scraper (Version 1.0.0) [Computer software]. https://github.com/DigitalHistory-Lund/kb-trend

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kb_trend-1.0.1.tar.gz (50.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kb_trend-1.0.1-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file kb_trend-1.0.1.tar.gz.

File metadata

  • Download URL: kb_trend-1.0.1.tar.gz
  • Upload date:
  • Size: 50.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kb_trend-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e37744468bd7dcd8e90c9e1615c501d1c854a95fad8b7011829a0c8f2acbae79
MD5 bf264eaac12cc7db97e13c5cdafa5e1a
BLAKE2b-256 3a3d13198065012a601ff9819d1ec551c8650a583e7baf9cd1674b7dc1a8ca04

See more details on using hashes here.

Provenance

The following attestation bundles were made for kb_trend-1.0.1.tar.gz:

Publisher: publish.yml on DigitalHistory-Lund/kb-trend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kb_trend-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: kb_trend-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kb_trend-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7ba2f5f65e8818276c3369c9cd7fd2aeeeca126d47b5ea973df68bec256bcbf9
MD5 b56a83e8294e57c32bdbaecc92e10d1f
BLAKE2b-256 cd4a45156a939ca56d79f5a6385a3d7a2be3629e3259bc5db1eb3593c4e912c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kb_trend-1.0.1-py3-none-any.whl:

Publisher: publish.yml on DigitalHistory-Lund/kb-trend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page