Skip to main content

ETL (Extract, Transform and Load) for bibliographic data using an advance data workflow

Project description

Python package

Kahi

KAHI is a powerful ETL (Extract, Transform, Load) application designed to construct an academic database by merging databases and files from various sources. It simplifies the database construction process by offering a framework to define a workflow of sequential tasks using a plugin system that KAHI understands.

Plugins

Take a look on plugins examples in the repository https://github.com/colav/Kahi_plugins

List of available plugins:

  • kahi_doaj_sources
  • kahi_minciencias_opendata_affiliations
  • kahi_minciencias_opendata_person
  • kahi_openalex_person
  • kahi_openalex_sources
  • kahi_openalex_subjects
  • kahi_ror_affiliations
  • kahi_scienti_affiliations
  • kahi_scienti_person
  • kahi_scienti_sources
  • kahi_scimago_sources
  • kahi_staff_udea_affiliations
  • kahi_staff_udea_person
  • kahi_wikipedia_affiliations
  • kahi_works

Installation

To install KAHI, follow these simple steps:

  1. Make sure you have Python installed on your system.
  2. Open a terminal or command prompt.
  3. Run the following command:
pip install kahi

Additionally, if you require specific plugins, you can install them using the following command:

pip install plugin-name

Replace plugin-name with the name of the desired plugin.

If the user wants to install all available plugins run:

pip install kahi[all]

Usage

To use KAHI, you need to define a YAML file that contains the workflow and global configuration variables. Here is an example of a YAML file:

config:
  database_url: localhost:27017
  database_name: kahi
  log_database: kahi_log
  log_collection: log
workflow:
  scimago_sources:
    file_path: scimago/scimagojr 2020.csv
  doaj_sources:
    database_url: localhost:27017
    database_name: doaj
    collection_name: stage

In the config section, you can specify the MongoDB URL, database name, log database, and log collection for KAHI to use.

The workflow section contains the sequential tasks of the workflow. Each task is defined with a unique name and specific configuration options based on the data source. In the example above, three tasks are defined: ror_affiliations, staff_affiliations, and scienti_affiliations. Every task should be related to a plugin

Finally, to run the workflow, use the following command:

kahi_run --workflow worflow.yaml

Replace workflow.yaml with the path to your YAML file.

Logging

KAHI keeps a detailed log of each task's execution in a mongodb collection, including the name, execution time, elapsed time, execution status, and error messages. This information is valuable for both users and developers, and it enables the ability to resume the workflow from the last successful task.

Contributing

If you are interested in contributing to KAHI or creating your own plugins, please refer to the kahi-plugins repository. It contains the necessary resources and documentation to implement new plugins easily. Feel free to submit pull requests or report any issues you encounter.

License

BSD-3-Clause License

Links

http://colav.udea.edu.co/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Kahi-0.0.2a0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

Kahi-0.0.2a0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file Kahi-0.0.2a0.tar.gz.

File metadata

  • Download URL: Kahi-0.0.2a0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for Kahi-0.0.2a0.tar.gz
Algorithm Hash digest
SHA256 bd0e4e07a9005fe33ecb769ef2de7a8237dc99b3daa7db996758e68fd47446a6
MD5 ff88bfd93ad363cc29b36914f21c85cc
BLAKE2b-256 702de1834fab8f8054a920d57e915e5247b0c7d5fbd311ba3270c8c1fa5e5937

See more details on using hashes here.

File details

Details for the file Kahi-0.0.2a0-py3-none-any.whl.

File metadata

  • Download URL: Kahi-0.0.2a0-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for Kahi-0.0.2a0-py3-none-any.whl
Algorithm Hash digest
SHA256 59565267ddd9eefa5cdc5ce5adb7901290d68d80313cc464013acac1d55d60e7
MD5 81e8d76935341740d92dcb1eafa3a996
BLAKE2b-256 55a8dc40d40253c8282515680d98677f1d9e1634e5e0a50965f3e8d79cb2f23b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page