Skip to main content

A python module to read and parse ALTO files

Project description

simple-alto-parser

This is a simple parser for ALTO XML files. It is designed to do three tasks separately:

1. Extract the text from the ALTO XML file with the AltoTextParser class.

The simple alto parser facilitates the extraction of text from ALTO XML files, specifically tailored for retrieving OCR-recognized texts (e.g. through “Transkribus” https://readcoop.eu/de/transkribus/). Defined or manipulated by the user, the simple alto parser organizes the extracted text alongside its pertinent meta-information. Including for example the year of publication, page number, and pre-defined image sections etc. This structuring ensured that each part of the text is associated with its corresponding image and meta-information, facilitating precise retrieval and display at a later stage.

2. Extract structured information from the text with different parsing methods.

There exist two fundamental parsing methods: 1. The extraction of relevant text segments through pattern recognition utilizing regular expressions (regex). 2. Matching based on predefined dictionaries. Whereby users can create their own dictionaries or utilize pre-existing ones (e.g. for country names etc.) Text segments recognized through these methods can be systematically categorized within the same function and subsequently omitted from the parsing process. This framework facilitates a multi-stage parsing approach with a low entry threshold, wherein uncategorized text segments remain visible following each parsing iteration. The simple alto parser also features a replace function, empowering the normalization of text segments during parsing. Simultaneously, irrelevant text segments can be removed from the structured text. A dedicated function for manual adjustments is also included, enabling modifications in cases where neither dictionary nor pattern matching proves sufficient. Employing these parsing methods enables comprehensive structuring of the text while preserving readability and retaining all meta-information. To maintain transparency and traceability, the original text is consistently preserved, ensuring visibility of all manipulations performed.

3. Export the structured text and information for further processing, analysis, publication etc.

The structured texts can be exported in various formats such as CSV, TSV, or JSON, with export parameters customizable to suit specific requirements. This allows tailored compilations of the structured texts. For example including all meta-information or solely the final structured texts. Allowing the data to be prepared directly for publication or for further analysis in the desired format and compilation. Further JSON files are generated for the dictionaries created during the parsing process. These files can also be exported in a structured manner, facilitating publication or utilization elsewhere.

Usage

from simple_alto_parser import AltoFileParser, AltoPatternParser, AltoFileExporter

# Create a parser instance and supply your data directory
alto_parser = AltoFileParser('data')
alto_parser.parse()

# Find and categorize by patterns
pattern_parser = AltoPatternParser(alto_parser)
pattern_parser.find(r'(^.*\& Cie\.$)').categorize('company_name').remove()

# Other options are: look up in dictionaries, perform spacy NER

# Export the data
alto_exporter = AltoFileExporter(alto_parser)
alto_exporter.save_csv('output/alto_test.csv', delimiter=',')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_alto_parser-0.0.19.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

simple_alto_parser-0.0.19-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file simple_alto_parser-0.0.19.tar.gz.

File metadata

  • Download URL: simple_alto_parser-0.0.19.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for simple_alto_parser-0.0.19.tar.gz
Algorithm Hash digest
SHA256 87c151e60c462d180baf9cfc4a0a328f19471cf9f525b067f8a8c9f72a805a7f
MD5 c81321660ac63935e2acc8bc01983495
BLAKE2b-256 b14aa000a7a6bc701d441859ce2d282217df5cb4901458d3f60daf65e05a844d

See more details on using hashes here.

File details

Details for the file simple_alto_parser-0.0.19-py3-none-any.whl.

File metadata

File hashes

Hashes for simple_alto_parser-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 05848c57c7d782930ef00b143ce23c5391cec97454ae9ac67117cf6f972c7a46
MD5 4915000bf513509dfa21b188dc6bda52
BLAKE2b-256 e06c76e3556db89cd854a445b06e449c3dd0cb097c790667e529f54f23687fec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page