A scraper to library to scrape .docx files with 'Entscheidungsbaumdiagramm' tables into a truely machine readable structure
Project description
ebdamame
🇩🇪 Dieses Repository enthält ein Python-Paket namens ebdamame (früher: ebddocx2table), das genutzt werden kann, um aus .docx-Dateien maschinenlesbare Tabellen, die einen Entscheidungsbaum (EBD) modellieren, zu extrahieren (scrapen).
Diese Entscheidungsbäume sind Teil eines regulatorischen Regelwerks für die deutsche Energiewirtschaft und kommen in der Eingangsprüfung der Marktkommunikation zum Einsatz.
Die mit diesem Paket erstellten maschinenlesbaren Tabellen können mit rebdhuhn (früher: ebdtable2graph) in echte Graphen und Diagramme umgewandelt werden.
Exemplarische Ergebnisse des Scrapings finden sich als .json-Dateien im Repository machine-readable_entscheidungsbaumdiagramme.
🇬🇧 This repository contains the source code of the Python package ebdamame (formerly published as ebddocx2table).
Rationale
Assume that you want to analyse or visualize the Entscheidungsbaumdiagramme (EBD) by EDI@Energy. The website edi-energy.de, as always, only provides you with PDF or Word files instead of really digitized data.
The package ebdamame scrapes the .docx files and returns data in a model defined in the "sister" package rebdhuhn (formerly known as ebdtable2graph).
Once you scraped the data (using this package) you can plot it with rebdhuhn.
Both packages together form the ebd_toolchain which scrapes EBD.docx files from the edi_energy_mirror and pushes them to machine_readable-entscheidungsbaumdiagramme.
How to use the package
In any case, install the repo from PyPI:
pip install ebdamame
Use as a library
import json
from pathlib import Path
from ebdamame import get_ebd_docx_tables
from ebdamame.docxtableconverter import DocxTableConverter
docx_file_path = Path("unittests/test_data/ebd20230629_v34.docx")
# download this .docx File from edi-energy.de or find it in the unittests of this repository.
# https://github.com/Hochfrequenz/ebddocx2table/blob/main/unittests/test_data/ebd20230629_v34.docx
docx_tables = get_ebd_docx_tables(docx_file_path, ebd_key="E_0003")
converter = DocxTableConverter(
docx_tables,
ebd_key="E_0003",
ebd_name="E_0003_Bestellung der Aggregationsebene RZ prüfen",
chapter="MaBiS",
section="7.42.1"
)
result = converter.convert_docx_tables_to_ebd_table()
with open(Path("E_0003.json"), "w+", encoding="utf-8") as result_file:
# the result file can be found here:
# https://github.com/Hochfrequenz/machine-readable_entscheidungsbaumdiagramme/tree/main/FV2310
json.dump(result.model_dump(), result_file, ensure_ascii=False, indent=2, sort_keys=True)
Use as a CLI tool
to be written
How to use this Repository on Your Machine (for development)
Please follow the instructions in our Python Template Repository. And for further information, see the Tox Repository.
Contribute
You are very welcome to contribute to this template repository by opening a pull request against the main branch.
Related Tools and Context
This repository is part of the Hochfrequenz Libraries and Tools for a truly digitized market communication.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ebdamame-1.0.0.tar.gz.
File metadata
- Download URL: ebdamame-1.0.0.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
121500162543d483d6243cd9366b1ea1f8533fb53f1083d5a2119dd5160b0302
|
|
| MD5 |
1779fbe8bb4b2033ab9357a5828631fa
|
|
| BLAKE2b-256 |
e5e110b8d89b508b72ea484e63ff3c6be3644f5a900fd42d542e6c5160fdf98a
|
Provenance
The following attestation bundles were made for ebdamame-1.0.0.tar.gz:
Publisher:
python-publish.yml on Hochfrequenz/ebdamame
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ebdamame-1.0.0.tar.gz -
Subject digest:
121500162543d483d6243cd9366b1ea1f8533fb53f1083d5a2119dd5160b0302 - Sigstore transparency entry: 767034247
- Sigstore integration time:
-
Permalink:
Hochfrequenz/ebdamame@cc8dd42263583c40347307fb4ab23b47bd2be631 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Hochfrequenz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cc8dd42263583c40347307fb4ab23b47bd2be631 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ebdamame-1.0.0-py3-none-any.whl.
File metadata
- Download URL: ebdamame-1.0.0-py3-none-any.whl
- Upload date:
- Size: 30.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec59eea5c22790bbe62f879860f8582807131f983ca5ef307ad5fc8672fe9f9f
|
|
| MD5 |
d417cc5452eeeeb12f6d01163f161cef
|
|
| BLAKE2b-256 |
0201eb8f7153c66382ba074c5e22bdec6350648bc806a6862add81d230040488
|
Provenance
The following attestation bundles were made for ebdamame-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on Hochfrequenz/ebdamame
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ebdamame-1.0.0-py3-none-any.whl -
Subject digest:
ec59eea5c22790bbe62f879860f8582807131f983ca5ef307ad5fc8672fe9f9f - Sigstore transparency entry: 767034249
- Sigstore integration time:
-
Permalink:
Hochfrequenz/ebdamame@cc8dd42263583c40347307fb4ab23b47bd2be631 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Hochfrequenz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cc8dd42263583c40347307fb4ab23b47bd2be631 -
Trigger Event:
release
-
Statement type: