Skip to main content

A scraper to library to scrape .docx files with 'Entscheidungsbaumdiagramm' tables into a truely machine readable structure

Project description

ebddocx2table

[!IMPORTANT] ⚠ This is the last version using the name ebddocx2table. Both the repository and the Python package will be renamed to ebdamame.

Unittests status badge Coverage status badge Linting status badge Black status badge PyPi Status Badge

🇩🇪 Dieses Repository enthält ein Python-Paket namens ebddocx2table, das genutzt werden kann, um aus .docx-Dateien maschinenlesbare Tabellen, die einen Entscheidungsbaum (EBD) modellieren, zu extrahieren (scrapen). Diese Entscheidungsbäume sind Teil eines regulatorischen Regelwerks für die deutsche Energiewirtschaft und kommen in der Eingangsprüfung der Marktkommunikation zum Einsatz. Die mit diesem Paket erstellten maschinenlesbaren Tabellen können mit ebdtable2graph in echte Graphen und Diagramme umgewandelt werden. Exemplarische Ergebnisse des Scrapings finden sich als .json-Dateien im Repository machine-readable_entscheidungsbaumdiagramme.

🇬🇧 This repository contains the source code of the Python package ebddocx2table.

Rationale

Assume, that you want to analyse or visualize the Entscheidungsbaumdiagramme (EBD) by EDI@Energy. The website edi-energy.de, as always, only provides you with PDF or Word files instead of really digitized data.

The package ebddocx2table scrapes the .docx files and returns data in a model defined in the "sister" package ebdtable2graph.

Once you scraped the data (using this package) you can plot it with ebdtable2graph.

How to use the package

In any case, install the repo from PyPI:

pip install ebddocx2table

Use as a library

import json
from pathlib import Path

import cattrs

from ebddocx2table import TableNotFoundError, get_all_ebd_keys, get_ebd_docx_tables  # type:ignore[import]
from ebddocx2table.docxtableconverter import DocxTableConverter  # type:ignore[import]

docx_file_path = Path("unittests/test_data/ebd20230629_v34.docx")
# download this .docx File from edi-energy.de or find it in the unittests of this repository.
# https://github.com/Hochfrequenz/ebddocx2table/blob/main/unittests/test_data/ebd20230629_v34.docx
docx_tables = get_ebd_docx_tables(docx_file_path, ebd_key="E_0003")
converter = DocxTableConverter(
    docx_tables,
    ebd_key="E_0003",
    chapter="MaBiS",
    sub_chapter="7.42.1: AD: Bestellung der Aggregationsebene der Bilanzkreissummenzeitreihe auf Ebene der Regelzone",
)
result = converter.convert_docx_tables_to_ebd_table()
with open(Path("E_0003.json"), "w+", encoding="utf-8") as result_file:
    # the result file can be found here:
    # https://github.com/Hochfrequenz/machine-readable_entscheidungsbaumdiagramme/tree/main/FV2310
    json.dump(cattrs.unstructure(result), result_file, ensure_ascii=False, indent=2, sort_keys=True)

Use as a CLI tool

to be written

How to use this Repository on Your Machine (for development)

Please follow the instructions in our Python Template Repository. And for further information, see the Tox Repository.

Contribute

You are very welcome to contribute to this template repository by opening a pull request against the main branch.

Related Tools and Context

This repository is part of the Hochfrequenz Libraries and Tools for a truly digitized market communication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ebddocx2table-0.0.9.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

ebddocx2table-0.0.9-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file ebddocx2table-0.0.9.tar.gz.

File metadata

  • Download URL: ebddocx2table-0.0.9.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for ebddocx2table-0.0.9.tar.gz
Algorithm Hash digest
SHA256 d7538c81093435ce04f8d26dacdb7f4a7b66f252d1d55cb6c5618fb0ba618088
MD5 2171e39ded3ec04f810ed4bf1be5f59f
BLAKE2b-256 9570f889ee931274ca962a7765604f0fee72005584d2394d2ee23cff11ebb6ba

See more details on using hashes here.

File details

Details for the file ebddocx2table-0.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for ebddocx2table-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 27e1d38fd4afc842e09455b728dcd87a5a512cc018eea22e53dfc177beabda77
MD5 3b3088fcfd172be7c57cb3a23a93f9f3
BLAKE2b-256 0389082aa1d11838224687df8112b773bda1753cd3a204eb126c855423b615b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page