A Python package for analysing and restructuring the output of Automatic Text Recognition (ATR) pipelines.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

archival-structures

Tools for analysing PageXML/ATR transcriptions and scan images of archival documents: detecting and splitting two-page book openings, clustering text lines and page layouts, mining cross-page document-element sequences, ink-colour and missing-transcription detection, and parsing EAD/METS archival finding-aid metadata.

Full documentation (including the per-module API reference) lives in docs/ and is built with Sphinx; see Documentation below.

Techniques and tasks

Archival images and transcriptions are organised as <institute_id>/<archive_id>/<inventory_num_id>/<scan>. The core idea behind this package is that one inventory number's worth of scans is a structured, ordered corpus, not a set of independent images -- so the analysis is built up in layers:

Opening detection and splitting (archival_structures.analysis.opening_detection) -- decide whether a scan is a two-page spread, split it into independent verso/recto pages, and classify a whole inventory number as a book of openings versus a mixed folder/booklet.
Page-layout clustering (archival_structures.analysis.page_layout_clustering) -- cluster whole pages by the spatial arrangement of their text lines, via a grid-pattern TF-IDF fingerprint.
Line clustering (archival_structures.analysis.line_clustering) -- cluster individual text lines by indentation/width/height into a vocabulary of recurring line types (body text, closing lines, marginalia, ...).
Sequence-pattern mining (archival_structures.analysis.sequence_patterns) -- order lines into a corpus-wide reading sequence and segment it into document elements, including elements that span a page break.

Tasks 2 and 3 both depend on splitting first (task 1) -- clustering whole two-page scans conflates the left and right page's geometry into one coordinate frame.

Alongside the text-analysis pipeline:

Ink colour, multi-colour text, and missing transcriptions (archival_structures.clustering.colour_clustering) -- robust ink/paper separation via multiotsu + connected-component shape (resistant to small artefacts like a sticker or stain), screening pages for more than one ink colour via LAB chroma spread, and flagging untranscribed page regions whose pixels look like genuine ink rather than blank paper.
Coordinate-space bridging (archival_structures.model.image, archival_structures.image) -- converting between a scan's native pixel coordinates, a thumbnail's, and a canvas rendering of a selection, via an affine Transform; converting between PageXML Coords and this package's own Box type; ipywidgets-based interactive region drawing/tagging.
Ground-truth annotation (archival_structures.datasets.annotations) -- a multi-level namespace:type(:subtype)?(#N)? tag vocabulary (see docs/vocabulary.md) for labelling scans/pages/lines/cross-page elements, plus ipywidgets notebook apps for producing it one scan (archival_structures.datasets.annotations) or one cluster (archival_structures.datasets.bulk_tagging) at a time.
Stream analysis (archival_structures.stream_analysis) -- a separate concern from the PageXML pipeline: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging, and active-learning ground-truth creation for a plain directory of document images (no PageXML required) -- see docs/stream_analysis.md.
EAD/METS parsing (archival_structures.parsers) -- a separate concern from the PageXML/image pipeline: parsing the archival finding-aid metadata (series/subseries/file structure, page manifests) that describes an archive's holdings.

See docs/findings.md for the concrete, validated-against-real-data lessons learned while building this -- several of the choices above (e.g. splitting before clustering, chroma spread over luminosity-class counting for multi-colour detection) turned out to matter a lot more than they first appeared to.

Demo notebooks

All in notebooks/demo/:

annotate-scans.ipynb -- ipywidgets ground-truth annotation app.
bulk-tag-annotation-demo.ipynb -- tagging many scans at once by cluster, with a structured namespace/type/subtype tag builder instead of free text.
inventory-structure-demo.ipynb -- classifying a whole inventory number as a book of openings vs a mixed folder.
opening-detection-demo.ipynb -- per-scan opening detection and splitting.
line-clustering-demo.ipynb and line-clustering-table-vs-deeds-demo.ipynb -- clustering text lines by indentation/width, and comparing that across a table-like register versus notary deeds.
page-layout-clustering-demo.ipynb and page-layout-clustering-table-vs-deeds-demo.ipynb -- clustering pages by text-line layout, and the same table-vs-deeds comparison.
pagexml-image-region-linking.ipynb -- drawing PageXML regions on a thumbnail, and converting a manually-drawn selection back into a new PageXML region.
pagexml-image-multicolour-explorer.ipynb -- screening a sample of scans for multi-colour text and missing-transcription candidates.
sequence-patterns-demo.ipynb -- mining recurring n-gram patterns and cross-page document elements, comparing the table register against the notary deeds.
stream-analysis-overview-demo.ipynb and stream-analysis-groundtruth-demo.ipynb -- embeddings + clustering, optional VLM tagging, and active-learning ground-truth creation for a plain directory of document images (no PageXML required).

Demo data

The notebooks above need real PageXML/thumbnail data (~341MB across 7 inventory numbers) that isn't committed to this repo -- only the package code is. Download demo-data.zip from the latest release and extract it at the repository root:

unzip demo-data.zip -d .

This recreates data/PageXML/, data/thumbs/, and data/annotations/ with exactly the inventory numbers the demo notebooks reference, so they run unchanged once extracted.

Installation

poetry install

Requires Python >=3.11,<3.15 -- torch's triton dependency caps out at Python <3.15, so the project's declared Python range matches that rather than the more typical <4.0.

Documentation

Built with Sphinx; requires the optional docs dependency group:

poetry install --with docs
cd docs
make html

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

marijn.koolen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Jul 3, 2026

This version

0.2.0

Jun 30, 2026

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archival_structures-0.2.0.tar.gz (142.6 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

archival_structures-0.2.0-py3-none-any.whl (172.0 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file archival_structures-0.2.0.tar.gz.

File metadata

Download URL: archival_structures-0.2.0.tar.gz
Upload date: Jun 30, 2026
Size: 142.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for archival_structures-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6a0347dcf2f527ed623f7c21cb8d4bf8a801a7b73f4bfd0625781f1087542c83`
MD5	`02268c24171c2c9de9750d60d31fd67b`
BLAKE2b-256	`bd67b43e057e5f357c0372eba8737b10d964506c4f7f34e46259d382720995e3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for archival_structures-0.2.0.tar.gz:

Publisher: publish.yml on Data-Scopes/archival-structures

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: archival_structures-0.2.0.tar.gz
- Subject digest: 6a0347dcf2f527ed623f7c21cb8d4bf8a801a7b73f4bfd0625781f1087542c83
- Sigstore transparency entry: 2024249594
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: Data-Scopes/archival-structures@6ce8005fd729be852ee5fc9825630b581c7c1fab
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Data-Scopes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6ce8005fd729be852ee5fc9825630b581c7c1fab
- Trigger Event: release

File details

Details for the file archival_structures-0.2.0-py3-none-any.whl.

File metadata

Download URL: archival_structures-0.2.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 172.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for archival_structures-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1110f4bcb9d4640c9e9d67275073edcf686b43b3ed5e5d314d3a8bfce88e44b2`
MD5	`bf76c7463b25eb0ebaade046e5d54e31`
BLAKE2b-256	`5965fb2910f6fea317a5e0f222f445151f5f77f6880f44a0324e29831b16f6c9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for archival_structures-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Data-Scopes/archival-structures

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: archival_structures-0.2.0-py3-none-any.whl
- Subject digest: 1110f4bcb9d4640c9e9d67275073edcf686b43b3ed5e5d314d3a8bfce88e44b2
- Sigstore transparency entry: 2024249750
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: Data-Scopes/archival-structures@6ce8005fd729be852ee5fc9825630b581c7c1fab
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Data-Scopes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6ce8005fd729be852ee5fc9825630b581c7c1fab
- Trigger Event: release

archival-structures 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

archival-structures

Techniques and tasks

Demo notebooks

Demo data

Installation

Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance