Skip to main content

A python module to read and parse ALTO files

Project description

simple-alto-parser

This is a simple parser for ALTO XML files. It is designed to do three tasks separately:

1. Extract the text from the ALTO XML file with the AltoTextParser class.

The simple alto parser facilitates the extraction of text from ALTO XML files, specifically tailored for retrieving OCR-recognized texts (e.g. through “Transkribus” https://readcoop.eu/de/transkribus/). Defined or manipulated by the user, the simple alto parser organizes the extracted text alongside its pertinent meta-information. Including for example the year of publication, page number, and pre-defined image sections etc. This structuring ensured that each part of the text is associated with its corresponding image and meta-information, facilitating precise retrieval and display at a later stage.

2. Extract structured information from the text with different parsing methods.

There exist two fundamental parsing methods: 1. The extraction of relevant text segments through pattern recognition utilizing regular expressions (regex). 2. Matching based on predefined dictionaries. Whereby users can create their own dictionaries or utilize pre-existing ones (e.g. for country names etc.) Text segments recognized through these methods can be systematically categorized within the same function and subsequently omitted from the parsing process. This framework facilitates a multi-stage parsing approach with a low entry threshold, wherein uncategorized text segments remain visible following each parsing iteration. The simple alto parser also features a replace function, empowering the normalization of text segments during parsing. Simultaneously, irrelevant text segments can be removed from the structured text. A dedicated function for manual adjustments is also included, enabling modifications in cases where neither dictionary nor pattern matching proves sufficient. Employing these parsing methods enables comprehensive structuring of the text while preserving readability and retaining all meta-information. To maintain transparency and traceability, the original text is consistently preserved, ensuring visibility of all manipulations performed.

3. Export the structured text and information for further processing, analysis, publication etc.

The structured texts can be exported in various formats such as CSV, TSV, or JSON, with export parameters customizable to suit specific requirements. This allows tailored compilations of the structured texts. For example including all meta-information or solely the final structured texts. Allowing the data to be prepared directly for publication or for further analysis in the desired format and compilation. Further JSON files are generated for the dictionaries created during the parsing process. These files can also be exported in a structured manner, facilitating publication or utilization elsewhere.

Usage

from simple_alto_parser import AltoFileParser, AltoPatternParser, AltoFileExporter

# Create a parser instance and supply your data directory
alto_parser = AltoFileParser('data')
alto_parser.parse()

# Find and categorize by patterns
pattern_parser = AltoPatternParser(alto_parser)
pattern_parser.find(r'(^.*\& Cie\.$)').categorize('company_name').remove()

# Other options are: look up in dictionaries, perform spacy NER

# Export the data
alto_exporter = AltoFileExporter(alto_parser)
alto_exporter.save_csv('output/alto_test.csv', delimiter=',')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_alto_parser-0.0.15.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

simple_alto_parser-0.0.15-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file simple_alto_parser-0.0.15.tar.gz.

File metadata

  • Download URL: simple_alto_parser-0.0.15.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for simple_alto_parser-0.0.15.tar.gz
Algorithm Hash digest
SHA256 6007837e3165fb4eb166a92c45a2227ade4852821ba74cec6a30c76672f8be0d
MD5 a3275c54e008f13923020efacade9935
BLAKE2b-256 8851ad69884702703f7760f3c2ef03e632da5e127f0e6b8cf13cda2c95cf40a2

See more details on using hashes here.

File details

Details for the file simple_alto_parser-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for simple_alto_parser-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 267760dc8e65f67e604cc22afc3489056bc69525381ee641d413d04767908bce
MD5 1e84f4f77b5f19590e8d9d594e5df516
BLAKE2b-256 5a5dc1f41131124590b3e2a90a96ba00fd9ebdfad9afd4c59c7a176b8fe8a9a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page