Swedish National Library newspaper trend scraper
Project description
KB-Trend
A CLI tool for scraping historical newspaper trend data from the Swedish National Library (Kungliga biblioteket).
Features
- Modern CLI built with Typer and Rich for a great user experience
- Flexible keyword loading from .txt, .csv, or .tsv files
- Proximity search support with customizable markers
- Configuration validation via SHA256 hashing to prevent data corruption
- SQLite database with SQLAlchemy ORM for reliable data storage
- Type-safe with full type hints and mypy validation
Installation
Using pipx (Recommended)
pipx install kb-trend
Using pip
python -m pip install kb-trend
Development Installation
git clone https://github.com/matjoha/kb-trend
cd kb-trend
pip install -e ".[dev]"
Quick Start
1. Initialize Configuration
Run the interactive setup wizard:
kb-trend init
Or use non-interactive mode with defaults:
kb-trend init --non-interactive
This creates:
settings.yaml- Configuration filekb_trend.sqlite3- SQLite database- Wildcard query for baseline measurements
2. Add Keywords
Load keywords from a file:
# From plain text file (one keyword per line)
kb-trend add-keywords keywords.txt
# From CSV file
kb-trend add-keywords keywords.csv
# From TSV file
kb-trend add-keywords keywords.tsv
Example CSV format:
title,gender,category
gosse,male,youth
flicka,female,youth
All columns are stored as metadata, and you specify which column is the keyword
in settings.yaml.
3. Run the Scraper
Execute the scraping queue:
kb-trend run
Options:
--limit N- Process only N items--resume/--restart- Resume from last run or restart--config PATH- Use alternate config file
4. Calculate Relative Frequencies
Normalize counts against baseline:
kb-trend process
5. Check Status
View database statistics:
kb-trend status
Configuration
The settings.yaml file controls all aspects of the scraper:
db_path: kb_trend.sqlite3
min_year: 1820 # Optional: filter start year
max_year: 2020 # Optional: filter end year
journals: # List of newspapers
- "None" # "None" searches all journals
- "DAGENS NYHETER"
sleep_timer: 1.0 # Seconds between requests
request_timeout: 30 # HTTP timeout
keyword_column: "title" # Which CSV column is the keyword
marker_templates: # Empty = plain search
- "SÖKES"
- "PLATS"
- "ERHÅLLES"
proximity_distance: 5 # Proximity search window
Configuration Hash Validation
KB-Trend calculates a SHA256 hash of your configuration and stores it in the database. This prevents accidental data corruption if settings change after the database is created.
If you modify settings.yaml, you'll need to:
- Restore the original settings, or
- Create a new database with
kb-trend init --force
Validate your configuration:
kb-trend validate
Query Types
Plain Keyword Search
When marker_templates is empty:
Query: "gosse"
Proximity Search
When markers are configured:
Query: "gosse SÖKES"~5 OR "gosse PLATS"~5 OR "gosse ERHÅLLES"~5
This finds "gosse" within 5 words of the markers.
API
KB-Trend uses the new KB.se data API:
https://data.kb.se/search/?q=PHRASE&searchGranularity=part&from=YYYY-MM-DD&to=YYYY-MM-DD&isPartOf=JOURNAL
This replaces the old Selenium-based scraping of the tidningar.kb.se interface, providing:
- Faster, more reliable scraping
- JSON responses instead of HTML parsing
- No browser dependencies
- Better error handling
Database Schema
- metadata: Configuration hash, schema version
- query: Search queries with metadata from CSV
- journal: Newspaper definitions
- counts: Hit counts by year/query/journal
- queue: Processing queue with status tracking
CLI Commands
| Command | Description |
|---|---|
kb-trend init |
Run configuration wizard |
kb-trend add-keywords <file> |
Load keywords from file |
kb-trend run |
Execute scraping queue |
kb-trend process |
Calculate relative frequencies |
kb-trend status |
Show database statistics |
kb-trend validate |
Validate configuration hash |
kb-trend reset |
Reset queue to pending |
Development
Running Tests
# Run all tests with coverage
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_keywords/test_loader.py
Type Checking
mypy src/kb_trend
Linting
ruff check src/kb_trend tests
Migration from Old Version
The original KB_TrendScraper used Selenium to scrape the tidningar.kb.se interface. This new version:
- Uses the official KB data API (faster, more reliable)
- Provides a proper CLI with subcommands
- Supports flexible keyword file formats
- Validates configuration to prevent errors
- Has comprehensive test coverage
No automatic migration is provided. To migrate:
- Export your old data if needed
- Run
kb-trend initto create new configuration - Load your keywords with
kb-trend add-keywords - Run the scraper
License
CC BY NC 4.0
Credits
Based on the original KB_TrendScraper project, modernized with:
- Typer for CLI
- httpx for HTTP requests
- SQLAlchemy for database
- Pydantic for configuration validation
- pytest for comprehensive testing
Citing this tool
If you use KB-Trend in your research, please cite it as:
@software{johansson2025kbtrend,
author = {Johansson, Mathias},
title = {{KB-Trend: Swedish National Library newspaper trend scraper}},
year = {2025},
version = {1.0.0},
url = {https://github.com/DigitalHistory-Lund/kb-trend},
license = {CC-BY-NC-4.0}
}
Or in APA format:
Johansson, M. (2025). KB-Trend: Swedish National Library newspaper trend scraper (Version 1.0.0) [Computer software]. https://github.com/DigitalHistory-Lund/kb-trend
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kb_trend-1.0.1.tar.gz.
File metadata
- Download URL: kb_trend-1.0.1.tar.gz
- Upload date:
- Size: 50.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e37744468bd7dcd8e90c9e1615c501d1c854a95fad8b7011829a0c8f2acbae79
|
|
| MD5 |
bf264eaac12cc7db97e13c5cdafa5e1a
|
|
| BLAKE2b-256 |
3a3d13198065012a601ff9819d1ec551c8650a583e7baf9cd1674b7dc1a8ca04
|
Provenance
The following attestation bundles were made for kb_trend-1.0.1.tar.gz:
Publisher:
publish.yml on DigitalHistory-Lund/kb-trend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kb_trend-1.0.1.tar.gz -
Subject digest:
e37744468bd7dcd8e90c9e1615c501d1c854a95fad8b7011829a0c8f2acbae79 - Sigstore transparency entry: 769470368
- Sigstore integration time:
-
Permalink:
DigitalHistory-Lund/kb-trend@8e0e702db9e92309b7344e6f012df1ea18cb771c -
Branch / Tag:
refs/tags/1.0.1 - Owner: https://github.com/DigitalHistory-Lund
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8e0e702db9e92309b7344e6f012df1ea18cb771c -
Trigger Event:
release
-
Statement type:
File details
Details for the file kb_trend-1.0.1-py3-none-any.whl.
File metadata
- Download URL: kb_trend-1.0.1-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ba2f5f65e8818276c3369c9cd7fd2aeeeca126d47b5ea973df68bec256bcbf9
|
|
| MD5 |
b56a83e8294e57c32bdbaecc92e10d1f
|
|
| BLAKE2b-256 |
cd4a45156a939ca56d79f5a6385a3d7a2be3629e3259bc5db1eb3593c4e912c9
|
Provenance
The following attestation bundles were made for kb_trend-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on DigitalHistory-Lund/kb-trend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kb_trend-1.0.1-py3-none-any.whl -
Subject digest:
7ba2f5f65e8818276c3369c9cd7fd2aeeeca126d47b5ea973df68bec256bcbf9 - Sigstore transparency entry: 769470381
- Sigstore integration time:
-
Permalink:
DigitalHistory-Lund/kb-trend@8e0e702db9e92309b7344e6f012df1ea18cb771c -
Branch / Tag:
refs/tags/1.0.1 - Owner: https://github.com/DigitalHistory-Lund
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8e0e702db9e92309b7344e6f012df1ea18cb771c -
Trigger Event:
release
-
Statement type: