Skip to main content

Command Line Interface tool for scraping Matricula Online https://data.matricula-online.eu.

Project description

Matricula Online Scraper

PyPI - Python Version GitHub License PyPI - Version Publish to PyPi

Matricula Online is a website that hosts digitized parish registers from various regions across Europe. This CLI-based tool allows you to directly download data from it.

Installation

Make sure to meet the minimum required version of Python. You can install this tool via pip:

$ pip install -u matricula-online-scraper
Build from source

If you want to get the latest version or just build from source, you can clone the repository and install it manually, favorably via uv:

$ git clone git@github.com:lsg551/matricula-online-scraper.git && cd matricula-online-scraper
$ uv venv && uv sync

Alternatively, you can always fallback to pip:

$ pip install -r requirements.txt

Usage

You can use this tool to scrape the three primary entities from Matricula:

  1. Scanned parish registers (→ images of baptism, marriage, and death records)
  2. A list of all available parishes (→ location metadata)
  3. A list for each parish with metadata about its registers, including dates ranges, type etc.

Most users likely want to scrape the scanned parish registers (1). The additional metadata (2,3) can be useful for other purposes such as automation, filtering or searching.

Note that this tool will not format or clean the data in any way. Instead, the data is saved as-is to a file. Some data might be poorly formatted or inconsistent.

Run the following command to see the available commands and options:

$ matricula-online-scraper --help

(1) Example: Download a scanned parish register (all images of a book)

Imagine you opened a certain parish register on Matricula and want to download the entire book or a single page. Let's say you want to download the death register of Bautzen, Germany, starting from 1661. Copy the URL of the register when you are in the image viewer, this might look like https://data.matricula-online.eu/en/deutschland/dresden/bautzen/11/?pg=1.

Then run the following command and paste the URL into the prompt:

$ matricula-online-scraper parish fetch https://data.matricula-online.eu/en/deutschland/dresden/bautzen/11/?pg=1

Run matricula-online-scraper parish fetch --help to see all available options.

(2) Example: Download a huge list of all available parishes on Matricula

$ matricula-online-scraper parish list

This command will fetch all parishes from Matricula Online, effectively scraping the entire "Fonds" page. The resulting data looks like:

country    , region                          , name                 , url                                                                          , longitude         , latitude
Deutschland, "Passau, rk. Bistum"            , Arbing-bei-Neuoetting, https://data.matricula-online.eu/en/deutschland/passau/arbing-bei-neuoetting/, 12.7081934381511  , 48.32953342002908
Österreich , Oberösterreich: Rk. Diözese Linz, Eberschwang          , https://data.matricula-online.eu/en/oesterreich/oberoesterreich/eberschwang/ , 13.5620985        , 48.15550995
Polen      , "Breslau/Wroclaw, Staatsarchiv" , Hermsdorf            , https://data.matricula-online.eu/en/polen/breslau/hermsdorf/                 , 15.642741683666767, 50.84699257482722

It may take a few minutes to complete and will yield a few thousand rows. Each url value leads to the main page of the parish and can bepiped into the next command (3) to fetch metadata about the parish's registers.

[!TIP] The data only changes rarely. A GitHub workflow automatically executes this command once a week and pushes to cache/parishes. This has the advantage that you can download the data without having to run and waiting for the command yourself as well as taking some load off the Matricula servers.

Click here to download the entire CSV: 👉 parishes.csv 👈

Or with cURL:

curl -L https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz | gunzip > parishes.csv

Cache Parishes GitHub last commit (branch)

Run matricula-online-scraper parish list --help to see all available options.

(3) Example: Download a list about the registers of a single parish

This command will download a list of all available registers for a single parish, including certain metadata such as the type of register, the date range, and the URL to the register itself etc.

$ matricula-online-scraper parish show https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/

A sample from the output (here JSON Lines) looks like this:

{
    "name": "Taufen",
    "url": "https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/KB001/",
    "accession_number": "KB001",
    "date": "1715 - 1800",
    "register_type": "Taufen",
    "date_range_start": "Jan. 1, 1715",
    "date_range_end": "Dec. 31, 1800"
}

Run matricula-online-scraper parish show --help to see all available options.

Example: Combine with other commands and 3rd party tools to download all registers within a certain region.

The following command will download the cached list with all parishes, filter all parishes within the region "Paderborn", and pipe the parish URLs to matricula-online-scraper parish show to get the metadata about the registers for each parish. Then, matricula-online-scraper parish fetch will be called for all registers of each parish and proceeds to download the images of the registers.

curl -sL https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz | gunzip | csvgrep -c region -m "Paderborn" | csvcut -c url | csvformat --skip-header | xargs -n 1 -P 4 matricula-online-scraper parish show -o - | jq -r ".url // empty" | matricula-online-scraper parish fetch

It uses csvkit for processing the CSV data. Make sure to install it via pip install csvkit or your package manager if you want to replicate this example. Also make sure to have jq installed, as it is used to parse and manipulate the JSON output of some commands.

License

This project is licensed under the MIT License - see the LICENSE file for details.

You can read more about Matricula Online's terms of use and data licenses on their page or check out their robots.txt file at data.matricula-online.eu/robots.txt regarding restrictions of the use of automated tools (as of March 2025, they have none).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matricula_online_scraper-0.7.1.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matricula_online_scraper-0.7.1-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file matricula_online_scraper-0.7.1.tar.gz.

File metadata

File hashes

Hashes for matricula_online_scraper-0.7.1.tar.gz
Algorithm Hash digest
SHA256 e437d5d25aa1b5cd315516d1dfa83a8a622da381b23486309e3646bc9964d145
MD5 66ec4003a6e0f8a9ed0a322caa75fa31
BLAKE2b-256 9ed2a87845184e4c606e8e2e76f6ace3476cfe16f432a6df60c01eae98ddbc4b

See more details on using hashes here.

File details

Details for the file matricula_online_scraper-0.7.1-py3-none-any.whl.

File metadata

File hashes

Hashes for matricula_online_scraper-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 19a9161247e1772fdeacd4d4184e97282dfaebb61e65c57c219fc46cda8ee001
MD5 d1222b73f0458a87e5464ac2635de906
BLAKE2b-256 7b03e489c0518f3ca1bebde597d6f3b226812bd6fd9295b96ea610b921ed5bc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page