The XML-to-OCDS parser for the TEDective project based on lxml
Project description
etl
The code in this repo is part of the TEDective project. It defines an ETL pipeline to transform European public procurement data from Tenders Electronic Daily (TED) into a format that's easier to handle and analyse. Primarily, the TED XMLs (and eForms, WIP) are transformed into Open Contracting Data Standard (OCDS) JSON and parquet files to ease importing the data into a:
- Graph database (KuzuDB in our case, but processed dataa should be generic enough to support any graph database and a
- Search engine (Meilisearch in our case)
Organizations are deduplicated using Splinkg and linked to their GLEIF identifiers (WIP) before they are imported into the graph database.
Table of Contents
Background
The TEDective project aims to make European public procurement data explorable for non-experts. This transformation is more or lest based on the Open Contracting Data Standard (OCDS) EU Profile:
As such, this pipeline can be used standalone or as part of your project that does something interesting with TED data. We use it ourselves for the TEDective API that powers the TEDective UI.
Install
This will be available on PyPi soon. Until then you can install it via Nix:
# Install flake iinto your profile
nix profile install git+https://git.fsfe.org/TEDective/etl
run-pipeline --help
Alternatively, you can clone this repository and build it via Nix yourself:
git clone https://git.fsfe.org/TEDective/etl
cd etl
nix-build
result/bin/run-pipeline --help
Another way is to use poetry
directly:
poetry install
poetry run run-pipeline --help
Running the pipeline requires running luigi daemon. It is included in the project and you can run it with the following command:
# If using Nix
result/bin/run-server
# If using poetry
poetry run run-server
Usage
:construction: This is still under heavy development.
Maintainers
Contributing
The easiest way to start developing is to use devenv via
the provided flake.nix
. So, clone this repository and run:
# If you have Nix installed
nix develop --impure
# This will drop you into a shell with all the dependencies installed
# If you want to bring up a meilisearch instance, simply run:
devenv up
Small note: If editing the README, please conform to the standard-readme specification. Also, please ensure that documentation is kept in sync with the code. Please note that the main documentation repository is added to this repository via git-subrepo. To update the documentation, please use the following commands:
git-subrepo pull docs
cd ./docs
# Make your changes
git commit -am "docs: update documentation for new feature"
# Preview your changes
pnpm install
pnpm run dev
# If you're happy with your changes, push them
git-subrepo push docs
License
EUPL-1.2 © 2024 Free Software Foundation Europe e.V.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tedective_etl-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a08a32763c0e504fdac94bfc48e6d02deeb336f9dad21c79957b4fa1710e99ac |
|
MD5 | 61f42ee685bdb8854a78cd97d8ea237b |
|
BLAKE2b-256 | 6f811be2ab0a5df4b9a667ea23bd285ecfcb6a6a94fc233c4ecc7ad373d1fb79 |