Skip to main content

A python module to read and parse ALTO files

Project description

simple-alto-parser

This is a simple parser for ALTO XML files. It is designed to do three tasks separately:

1. Extract the text from the ALTO XML file with the AltoTextParser class.

The simple alto parser facilitates the extraction of text from ALTO XML files, specifically tailored for retrieving OCR-recognized texts (e.g. through “Transkribus” https://readcoop.eu/de/transkribus/). Defined or manipulated by the user, the simple alto parser organizes the extracted text alongside its pertinent meta-information. Including for example the year of publication, page number, and pre-defined image sections etc. This structuring ensured that each part of the text is associated with its corresponding image and meta-information, facilitating precise retrieval and display at a later stage.

2. Extract structured information from the text with different parsing methods.

There exist two fundamental parsing methods: 1. The extraction of relevant text segments through pattern recognition utilizing regular expressions (regex). 2. Matching based on predefined dictionaries. Whereby users can create their own dictionaries or utilize pre-existing ones (e.g. for country names etc.) Text segments recognized through these methods can be systematically categorized within the same function and subsequently omitted from the parsing process. This framework facilitates a multi-stage parsing approach with a low entry threshold, wherein uncategorized text segments remain visible following each parsing iteration. The simple alto parser also features a replace function, empowering the normalization of text segments during parsing. Simultaneously, irrelevant text segments can be removed from the structured text. A dedicated function for manual adjustments is also included, enabling modifications in cases where neither dictionary nor pattern matching proves sufficient. Employing these parsing methods enables comprehensive structuring of the text while preserving readability and retaining all meta-information. To maintain transparency and traceability, the original text is consistently preserved, ensuring visibility of all manipulations performed.

3. Export the structured text and information for further processing, analysis, publication etc.

The structured texts can be exported in various formats such as CSV, TSV, or JSON, with export parameters customizable to suit specific requirements. This allows tailored compilations of the structured texts. For example including all meta-information or solely the final structured texts. Allowing the data to be prepared directly for publication or for further analysis in the desired format and compilation. Further JSON files are generated for the dictionaries created during the parsing process. These files can also be exported in a structured manner, facilitating publication or utilization elsewhere.

Usage

from simple_alto_parser import AltoFileParser, AltoPatternParser, AltoFileExporter

# Create a parser instance and supply your data directory
alto_parser = AltoFileParser('data')
alto_parser.parse()

# Find and categorize by patterns
pattern_parser = AltoPatternParser(alto_parser)
pattern_parser.find(r'(^.*\& Cie\.$)').categorize('company_name').remove()

# Other options are: look up in dictionaries, perform spacy NER

# Export the data
alto_exporter = AltoFileExporter(alto_parser)
alto_exporter.save_csv('output/alto_test.csv', delimiter=',')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_alto_parser-0.0.20.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

simple_alto_parser-0.0.20-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file simple_alto_parser-0.0.20.tar.gz.

File metadata

  • Download URL: simple_alto_parser-0.0.20.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for simple_alto_parser-0.0.20.tar.gz
Algorithm Hash digest
SHA256 4d06fc377f16a17548cd0f3cb4d337ad53c7ea8c4a36a4d92b9371e8571206a6
MD5 55e6dfdc4df591d2851e277beaa23ab6
BLAKE2b-256 0c61bb5117b8644325bfe0fb81188be34f150afcc4c4379be1abdc6a16ead873

See more details on using hashes here.

File details

Details for the file simple_alto_parser-0.0.20-py3-none-any.whl.

File metadata

File hashes

Hashes for simple_alto_parser-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 7073d8ba020dc65598c9024f8a444a8fdc755ffd79d6d93bb71f53da1fda28b0
MD5 62099bbb57672714f615b7ef0c3c45e2
BLAKE2b-256 201f4d62aad67fa46006ccf847359b824de14c8d48ffc8f63618015cc813b506

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page