Skip to main content

A python module to read and parse ALTO files

Project description

simple-alto-parser

This is a simple parser for ALTO XML files. It is designed to do three tasks separately:

1. Extract the text from the ALTO XML file with the AltoTextParser class.

The simple alto parser facilitates the extraction of text from ALTO XML files, specifically tailored for retrieving OCR-recognized texts (e.g. through “Transkribus” https://readcoop.eu/de/transkribus/). Defined or manipulated by the user, the simple alto parser organizes the extracted text alongside its pertinent meta-information. Including for example the year of publication, page number, and pre-defined image sections etc. This structuring ensured that each part of the text is associated with its corresponding image and meta-information, facilitating precise retrieval and display at a later stage.

2. Extract structured information from the text with different parsing methods.

There exist two fundamental parsing methods: 1. The extraction of relevant text segments through pattern recognition utilizing regular expressions (regex). 2. Matching based on predefined dictionaries. Whereby users can create their own dictionaries or utilize pre-existing ones (e.g. for country names etc.) Text segments recognized through these methods can be systematically categorized within the same function and subsequently omitted from the parsing process. This framework facilitates a multi-stage parsing approach with a low entry threshold, wherein uncategorized text segments remain visible following each parsing iteration. The simple alto parser also features a replace function, empowering the normalization of text segments during parsing. Simultaneously, irrelevant text segments can be removed from the structured text. A dedicated function for manual adjustments is also included, enabling modifications in cases where neither dictionary nor pattern matching proves sufficient. Employing these parsing methods enables comprehensive structuring of the text while preserving readability and retaining all meta-information. To maintain transparency and traceability, the original text is consistently preserved, ensuring visibility of all manipulations performed.

3. Export the structured text and information for further processing, analysis, publication etc.

The structured texts can be exported in various formats such as CSV, TSV, or JSON, with export parameters customizable to suit specific requirements. This allows tailored compilations of the structured texts. For example including all meta-information or solely the final structured texts. Allowing the data to be prepared directly for publication or for further analysis in the desired format and compilation. Further JSON files are generated for the dictionaries created during the parsing process. These files can also be exported in a structured manner, facilitating publication or utilization elsewhere.

Usage

from simple_alto_parser import AltoFileParser, AltoPatternParser, AltoFileExporter

# Create a parser instance and supply your data directory
alto_parser = AltoFileParser('data')
alto_parser.parse()

# Find and categorize by patterns
pattern_parser = AltoPatternParser(alto_parser)
pattern_parser.find(r'(^.*\& Cie\.$)').categorize('company_name').remove()

# Other options are: look up in dictionaries, perform spacy NER

# Export the data
alto_exporter = AltoFileExporter(alto_parser)
alto_exporter.save_csv('output/alto_test.csv', delimiter=',')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_alto_parser-0.0.18.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

simple_alto_parser-0.0.18-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file simple_alto_parser-0.0.18.tar.gz.

File metadata

  • Download URL: simple_alto_parser-0.0.18.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for simple_alto_parser-0.0.18.tar.gz
Algorithm Hash digest
SHA256 15aaa32f9b8ca35acfb1b50fcdc3a5824a3e5c594322329045ab8753fa9c1680
MD5 6e1ff4f3d31d407799d412ec85c12489
BLAKE2b-256 3131ae3a76168ea339eff8190a925a3a1ffdf7c2896db351bf41bed01c590fe2

See more details on using hashes here.

File details

Details for the file simple_alto_parser-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for simple_alto_parser-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 f6eea9141d49256e9f48e46189dadf5cc138c28563a3e29a4c60890aaecae548
MD5 ed0e96b0c0d61bc50ccc4874743b91d1
BLAKE2b-256 de0f37c974807c31fe47e8c5983d448ab432b23b31e35d46c00ea86fcfd9df6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page