Skip to main content

Extract provenance information (W3C PROV) from GitLab projects.

Project description

:seedling: gitlab2prov: Extract Provenance from GitLab Projects

License: MIT made-with-python PyPI version fury.io DOI Open in Visual Studio Code

Git commits (by Cauldron.io) Issues created (by Cauldron.io) Issues closed (by Cauldron.io)

gitlab2prov is a Python library and command line tool for extracting provenance information from GitLab projects.

The data model employed by gitlab2prov has been modelled according to W3C PROV PROV specification. More information regarding the provenance model can be found in /docs.

️🏗️ ️Installation

Clone the project and use the provided setup.py to install gitlab2prov.

python setup.py install --user

👩‍💻 Usage

gitlab2prov can be used as a command line script and as a Python lib.

To extract provenance from a gitlab project, follow these steps:

Instructions Config Option
1. Obtain an API Token for the GitLab API (Token Guide) --token
2. Set the URL[s] for the GitLab Project[s] --project_urls
3. Choose a PROV serialization format --format

gitlab2prov can be configured either by command line flags or by using a config file.

📋 Config File Example

An example of a configuration file can be found in /config/example.ini.

# This is an example of a configuration file as used by gitlab2prov.
# The configuration options match the command line flags in function.

[GITLAB]
# Gitlab project urls as a comma seperated list.
project_urls = project_a_url, project_b_url

# Gitlab personal access token.
# More about tokens and how to create them:
# https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html#create-a-personal-access-token
token = token

[OUTPUT]
# Provenance serialization format.
# Supported formats: json, rdf, xml, provn, dot
format = json, rdf, xml

# File location to write provenance output to.
# Each specified format will result in a seperate file.
# For example:
#     format = json, xml
#     outfile = out/example
# Creates the files:
#     out/example.json
#     out/example.xml
outfile = provout/example

[MISC]
# Enables/Disables profiling using the cprofile lib.
# The runtime profile is written to a file called gitlab2prov-run-$TIMESTAMP.profile
# where $TIMESTAMP is the current time in 'YYYY-MM-DD-hh-mm-ss' format.
# The profile can be visualized using tools such as snakeviz.
profile = False

# Enables/Disables verbose output (DEBUG mode logging to stdout)
verbose = False

# Path to double agent mapping to unify duplicated agents.
double_agents = path/to/alias/mapping

# Enables/Disables agent pseudonymization by enumeration.
pseudonymous = False

🖥️ Command Line Usage ☝ Single Format Serialization

  usage: gitlab2prov [-h] -p PROJECT_URLS [PROJECT_URLS ...] -t TOKEN [-c CONFIG_FILE] [-f {json,rdf,xml,provn,dot}] [-v] [--double-agents DOUBLE_AGENTS] [--pseudonymous] [--profile] {multi-format} ...

Extract provenance information from GitLab projects.

positional arguments:
  {multi-format}
    multi-format        serialize output in multiple formats

options:
  -h, --help            show this help message and exit
  -p PROJECT_URLS [PROJECT_URLS ...], --project-urls PROJECT_URLS [PROJECT_URLS ...]
                        gitlab project urls
  -t TOKEN, --token TOKEN
                        gitlab api access token
  -c CONFIG_FILE, --config-file CONFIG_FILE
                        config file path
  -f {json,rdf,xml,provn,dot}, --format {json,rdf,xml,provn,dot}
                        provenance serialization format
  -v, --verbose         write log to stderr, set log level to DEBUG
  --double-agents DOUBLE_AGENTS
                        agent mapping file path
  --pseudonymous        pseudonymize user names by enumeration
  --profile             enable deterministic profiling, write profile to 'gitlab2prov-run-$TIMESTAMP.profile' where $TIMESTAMP is the current timestamp in 'YYYY-MM-DD-hh-mm-ss' format

🖥️ Command Line Usage 🖐 Multi Format Serialization

To serialize the extracted provenance information into multiple formats in one go, use the provided multi-format mode.

usage: gitlab2prov multi-format [-h] [-f {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...]] -o OUTFILE

options:
  -h, --help            show this help message and exit
  -f {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...], --format {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...]
                        provenance serialization formats
  -o OUTFILE, --outfile OUTFILE
                        serialize to {outfile}.{format} for each specified format

🎨 Provenance Output Formats

gitlab2prov supports output formats that the prov library provides:

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

How to cite

If you use GitLab2PROV in a scientific publication, we would appreciate citations to the following paper:

Bibtex entry:

@InProceedings{SchreiberBoerKurnatowski2021,
  author    = {Andreas Schreiber and Claas de~Boer and Lynn von~Kurnatowski},
  booktitle = {13th International Workshop on Theory and Practice of Provenance (TaPP 2021)},
  title     = {{GitLab2PROV}{\textemdash}Provenance of Software Projects hosted on GitLab},
  year      = {2021},
  month     = jul,
  publisher = {{USENIX} Association},
  url       = {https://www.usenix.org/conference/tapp2021/presentation/schreiber},
}

You can also cite specific releases published on Zenodo: DOI

References

Influencial Software for gitlab2prov

  • Martin Stoffers: "Gitlab2Graph", v1.0.0, October 13. 2019, GitHub Link, DOI 10.5281/zenodo.3469385

  • Quentin Pradet: "How do you rate limit calls with aiohttp?", GitHub Gist, MIT LICENSE

Influencial Papers for gitlab2prov:

Papers that refer to gitlab2prov:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlab2prov-1.1.2.tar.gz (32.8 kB view hashes)

Uploaded Source

Built Distribution

gitlab2prov-1.1.2-py3-none-any.whl (36.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page