Skip to main content

Extract structured metadata from git repositories.

Project description

Gimie

PyPI version Python Poetry Test

Gimie (GIt Meta Information Extractor) is a python library and command line tool to extract structured metadata from git repositories.

:warning: Gimie is at an early development stage. It is not yet functional.

Context

Scientific code repositories contain valuable metadata which can be used to enrich existing catalogues, platforms or databases. This tool aims to easily extract structured metadata from a generic git repositories. The following sources of information are used:

  • Github API
  • Gitlab API
  • Local Git metadata
  • License text
  • Free text in README
  • Renku project metadata

Installation

To install the stable version on PyPI:

pip install gimie

To install the dev version from github:

pip install git+https://github.com/SDSC-ORD/gimie.git@main#egg=gimie

Gimie is also available as a docker container hosted on the Github container registry:

docker pull ghcr.io/sdsc-ord/gimie:latest

# The access token can be provided as an environment variable
docker run -e ACCESS_TOKEN=$ACCESS_TOKEN ghcr.io/sdsc-ord/gimie:latest gimie data <repo>

For development:

activate a conda or virtual environment with Python 3.8 or higher

git clone https://github.com/SDSC-ORD/gimie && cd gimie
make install

run tests:

make test

run checks:

make check

Usage

Set your github credentials

In order to avoid rate limits with the github api, you need to provide your github username and a github token: see here on how to generate a github token.

There are 2 options for setting up your github token in your local environment:

Option 1:

cp .env.dist .env

And then edit the .env file and put your github token in.

Option 2:

Add your github token in your terminal:

export ACCESS_TOKEN=

After the github token has been added, you can run the command without running into an github api limit. Otherwise you can still run the command, but might hit that limit after running the command several times.

Run the command

As a command line tool:

gimie data https://github.com/numpy/numpy

As a python library:

from gimie.project import Project
proj = Project("https://github.com/numpy/numpy)

# To retrieve the rdflib.Graph object
g = proj.to_graph()

# To retrieve the serialized graph
proj.serialize(format='ttl')

Or to extract only from a specific source:

from gimie.sources.remote import GithubExtractor
gh = GithubExtractor('https://github.com/SDSC-ORD/gimie')
gh.extract()

# To retrieve the rdflib.Graph object
g = gh.to_graph()

# To retrieve the serialized graph
gh.serialize(format='ttl')

Outputs

The default output is JSON-ld, a JSON serialization of the RDF data model. We follow the schema recommended by codemeta. Supported formats are json-ld, turtle and n-triples.

Contributing

All contributions are welcome. New functions and classes should have associated tests and docstrings following the numpy style guide.

The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. We use pytest as our testing framework. This project uses pyproject.toml to define package information, requirements and tooling configuration.

Releases and Publishing on Pypi

Releases are done via github release

  • a release will trigger a github workflow to publish the package on Pypi
  • Make sure to update to a new version in pyproject.toml before making the release
  • It is possible to test the publishing on Pypi.test by running a manual workflow: go to github actions and run the Workflow: 'Publish on Pypi Test'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gimie-0.3.0.tar.gz (17.0 kB view hashes)

Uploaded Source

Built Distribution

gimie-0.3.0-py3-none-any.whl (23.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page