A Python package for analysing and restructuring the output of Automatic Text Recognition (ATR) pipelines.
Project description
archival-structures
Tools for analysing PageXML/ATR transcriptions and scan images of archival documents: detecting and splitting two-page book openings, clustering text lines and page layouts, mining cross-page document-element sequences, ink-colour and missing-transcription detection, and parsing EAD/METS archival finding-aid metadata.
Full documentation (including the per-module API reference) lives in docs/ and is
built with Sphinx; see Documentation below.
Techniques and tasks
Archival images and transcriptions are organised as
<institute_id>/<archive_id>/<inventory_num_id>/<scan>. The core idea behind this package is
that one inventory number's worth of scans is a structured, ordered corpus, not a set of
independent images -- so the analysis is built up in layers:
- Opening detection and splitting (
archival_structures.analysis.opening_detection) -- decide whether a scan is a two-page spread, split it into independent verso/recto pages, and classify a whole inventory number as a book of openings versus a mixed folder/booklet. - Page-layout clustering (
archival_structures.analysis.page_layout_clustering) -- cluster whole pages by the spatial arrangement of their text lines, via a grid-pattern TF-IDF fingerprint. A complementary fingerprint,archival_structures.analysis.relational_patterns(clustered byrelational_layout_clustering), instead encodes each line's own type and its RCC-8 spatial relation to its immediate below/right neighbour -- relational line-neighbourhood patterns a pixel-pattern fingerprint can't represent.- Structural whitespace (
archival_structures.analysis.empty_regions) -- detects and clusters significant whitespace regions within pages (computed geometrically, not from PageXML region markup) and scores which relational patterns are over-represented adjacent to those whitespace boundaries. - Cross-page boundaries (
archival_structures.analysis.boundary_detection) -- detects blank or near-blank pages in the page sequence, and identifies which page-layout clusters systematically appear before or after them. - Text-extent margins (
archival_structures.analysis.text_extent) -- measures how far from each page edge the first and last transcribed lines sit (relative top, bottom, left, right margins); classifies each page asfull_text,late_start,early_end, orshort; and characterises each inventory by its full-text page fraction -- a lightweight signal for distinguishing running-text books from sparse table registers or mixed-document archives.
- Structural whitespace (
- Line clustering (
archival_structures.analysis.line_clustering) -- cluster individual text lines by indentation/width/height into a vocabulary of recurring line types (body text, closing lines, marginalia, ...). - Sequence-pattern mining (
archival_structures.analysis.sequence_patterns) -- order lines into a corpus-wide reading sequence and segment it into document elements, including elements that span a page break.
Tasks 2 and 3 both depend on splitting first (task 1) -- clustering whole two-page scans conflates the left and right page's geometry into one coordinate frame.
Alongside the text-analysis pipeline:
- Ink colour, multi-colour text, and missing transcriptions
(
archival_structures.clustering.colour_clustering) -- robust ink/paper separation via multiotsu + connected-component shape (resistant to small artefacts like a sticker or stain), screening pages for more than one ink colour via LAB chroma spread, and flagging untranscribed page regions whose pixels look like genuine ink rather than blank paper. - Coordinate-space bridging (
archival_structures.model.image,archival_structures.image) -- converting between a scan's native pixel coordinates, a thumbnail's, and a canvas rendering of a selection, via an affineTransform; converting between PageXMLCoordsand this package's ownBoxtype; ipywidgets-based interactive region drawing/tagging. - Ground-truth annotation (
archival_structures.datasets.annotations) -- a multi-levelnamespace:type(:subtype)?(#N)?tag vocabulary (seedocs/vocabulary.md) for labelling scans/pages/lines/cross-page elements, plus ipywidgets notebook apps for producing it one scan (archival_structures.datasets.annotations) or one cluster (archival_structures.datasets.bulk_tagging) at a time. - Stream analysis (
archival_structures.stream_analysis) -- a separate concern from the PageXML pipeline: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging, and active-learning ground-truth creation for a plain directory of document images (no PageXML required) -- seedocs/stream_analysis.md.- Sequence pattern analysis (
archival_structures.stream_analysis.sequence_analysis) -- label-agnostic tools for analysing ordered sequences of cluster labels (from visual or layout clustering): run-length encoding and noise-run merging, cluster n-gram mining, tandem repeat detection (recurring cluster sub-sequences), and transition matrices. - Subsequence detection (
archival_structures.stream_analysis.overview.subsequence_detection) -- detects visually homogeneous (book-like) subsequences within a heterogeneous scan sequence using adjacent cosine similarity between DINOv2 embeddings; threshold-based and optional change-point (ruptures) boundary detection; scores each segment by mean similarity, cluster entropy, and optional opening consistency.
- Sequence pattern analysis (
- EAD/METS parsing (
archival_structures.parsers) -- a separate concern from the PageXML/image pipeline: parsing the archival finding-aid metadata (series/subseries/file structure, page manifests) that describes an archive's holdings.
See docs/findings.md for the concrete, validated-against-real-data lessons
learned while building this -- several of the choices above (e.g. splitting before clustering,
chroma spread over luminosity-class counting for multi-colour detection) turned out to matter a
lot more than they first appeared to.
Demo notebooks
All in notebooks/demo/:
annotate-scans.ipynb-- ipywidgets ground-truth annotation app.bulk-tag-annotation-demo.ipynb-- tagging many scans at once by cluster, with a structured namespace/type/subtype tag builder instead of free text.inventory-structure-demo.ipynb-- classifying a whole inventory number as a book of openings vs a mixed folder.opening-detection-demo.ipynb-- per-scan opening detection and splitting.line-clustering-demo.ipynbandline-clustering-table-vs-deeds-demo.ipynb-- clustering text lines by indentation/width, and comparing that across a table-like register versus notary deeds.page-layout-clustering-demo.ipynbandpage-layout-clustering-table-vs-deeds-demo.ipynb-- clustering pages by text-line layout, and the same table-vs-deeds comparison.relational-layout-clustering-table-vs-deeds-demo.ipynb-- clustering pages by line-type-and-neighbour-relation fingerprint instead of raw geometry, compared against the geometric clustering above.empty-region-clustering-demo.ipynb-- detecting and clustering significant whitespace regions within pages; contrasting the tiny inter-cell gaps in a table register against the structural blank areas in notary deed pages.boundary-within-pages-demo.ipynb-- which relational line-neighbourhood patterns (RCC-8 symbols) are over-represented immediately adjacent to significant whitespace regions -- the within-page boundary markers.boundary-across-pages-demo.ipynb-- which page-layout clusters appear near blank pages in the page sequence -- the across-page boundary markers; contrasts the table register's front-matter blanks against the notary deeds' regular blank-recto convention.full-text-page-detection-demo.ipynb-- detecting full-text pages from top/bottom text-extent margins; comparing six inventories (three HaNA table registers, two HaNA letter-copy books, one notary-deeds book) by their full-text page fraction, margin distribution, and line-width/equal-extent features.pagexml-image-region-linking.ipynb-- drawing PageXML regions on a thumbnail, and converting a manually-drawn selection back into a new PageXML region.pagexml-image-multicolour-explorer.ipynb-- screening a sample of scans for multi-colour text and missing-transcription candidates.sequence-patterns-demo.ipynb-- mining recurring n-gram patterns and cross-page document elements, comparing the table register against the notary deeds.stream-analysis-overview-demo.ipynbandstream-analysis-groundtruth-demo.ipynb-- embeddings + clustering, optional VLM tagging, and active-learning ground-truth creation for a plain directory of document images (no PageXML required).subsequence-detection-demo.ipynb-- detecting book-like subsequences within a heterogeneous scan sequence (NL-AsdSAA_89_3.1) using adjacent DINOv2 cosine similarity; validates against a known book run and identifies additional candidates.cluster-sequence-analysis-demo.ipynb-- sequence pattern analysis of cluster label sequences forNL-HaNA_2.10.50_1(visual and layout clustering) andNL-AsnDA_0114.11_1(layout clustering); demonstratesrun_length_encode,find_tandem_repeats,find_frequent_ngrams, andlabel_transition_matrix.resolution-cluster-sequence-demo.ipynb-- layout cluster sequence analysis for six resolution-book inventories fromNL-HaNA_1.01.02(3771–3823); discovers candidate section boundaries from cluster sequence patterns without using the available ground-truth section metadata.
Demo data
The notebooks above need real PageXML/thumbnail data (~341MB across 7 inventory numbers) that
isn't committed to this repo -- only the package code is. Download demo-data.zip from the
latest release and extract it at
the repository root:
unzip demo-data.zip -d .
This recreates data/PageXML/, data/thumbs/, and data/annotations/ with exactly the
inventory numbers the demo notebooks reference, so they run unchanged once extracted.
Installation
poetry install
Requires Python >=3.11,<3.15 -- torch's triton dependency caps out at Python <3.15, so the
project's declared Python range matches that rather than the more typical <4.0.
Documentation
Built with Sphinx; requires the optional docs dependency group:
poetry install --with docs
cd docs
make html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file archival_structures-0.3.0.tar.gz.
File metadata
- Download URL: archival_structures-0.3.0.tar.gz
- Upload date:
- Size: 170.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.14 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e336940c4dc044435d8556f9efcf1ac90b0c93c54f3d52af5f5e88c951c9eea
|
|
| MD5 |
de0194ee3cf86a585a6859dc333db9e6
|
|
| BLAKE2b-256 |
0ab213810f718ad892cdc99855cc520d5849be3a5a34233d59db63bef2e8b10f
|
File details
Details for the file archival_structures-0.3.0-py3-none-any.whl.
File metadata
- Download URL: archival_structures-0.3.0-py3-none-any.whl
- Upload date:
- Size: 203.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.14 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f99ab66b9137d79fa748b464268ae17f0db940db50e1c55b9795d561a8e004a1
|
|
| MD5 |
96905d16ee43dff4fe679bb6ea280f61
|
|
| BLAKE2b-256 |
29cce59bc083ea02ebfe012443257b3717508ccd0937ba8d1043faa45043c7f8
|