Skip to main content

Export GitLab data to external data stores

Project description

GitLab Data Export

gitlab-data-export is a simple utility that exports data from GitLab's REST API into arbitrary data stores.

Basics

gitlab-data-export is designed to be run from directly within GitLab itself using scheduled CI pipelines. Each time the tool is invoked, it searches for a checkpoint persisted from a previous invocation to determine the subset of data that should be exported from the API; if no checkpoint can be found, a historical backfill is run over all GitLab resources to seed the data store. This checkpoint is stored within the datastore keyed off the variable CHECKPOINT_JOB_ID.

Generally, all data that GitLab provides from its API should be available in the exported data.

Data can be exported from GitLab groups or projects. By default, exporting data from a group will also recursively export all data visible from that group's contained projects and all of its subgroups. This may comprise a large amount of data; recursive behavior can be disabled by using optional configurations.

Quickstart

By default, gitlab-data-export will export data to BigQuery. This requires a properly configured GCP project as described in Google's offical documentation. Once the project is configured, exporting data is straightforward:

  1. Create a GitLab access token with read_api permissions to the resources that you wish to export
  2. Create a service account keyfile with BigQuery Data Editor permissions in your BigQuery-enabled GCP project
  3. Create a GitLab project to run the CI pipeline
    • You can also use an existing project - only a single CI job definition is required
  4. Set the required environment variables in the GitLab project's CI/CD settings
  5. Add the following stanza to the GitLab project's CI configuration:
gitlab_export:
  image: python:3.11
  interruptible: true
  needs: []
  variables:
    CHECKPOINT_JOB_ID: $CI_JOB_NAME
  before_script:
    - python3.11 -m pip install gitlab_data_export --upgrade
  script:
    - python3.11 -m gitlab_data_export.driver
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: manual
      allow_failure: true
    - if: $CI_PIPELINE_SOURCE == "schedule"
  1. Optionally, run the pipeline manually to ensure that everything is working properly
    • Important: the first export invocation will export ALL history visible from the GitLab resource that you've configured. This could potentially comprise a large amount of data that will incur BigQuery costs. If you do not want a historical backfill to be performed, see the optional configuration for details on how to disable this behavior
  2. Merge the CI changes

The pipeline should now run according to your chosen frequency and incrementally export data each time the utility is invoked. Some query examples specific to the BigQuery storage backend can be found in the BigQuery section below.

Configuration

gitlab-data-export supports a number of configuration options. All configuration is managed via environment variables.

Required

  • GitLab authentication: either GITLAB_TOKEN or GITLAB_OAUTH must be set
    • GITLAB_TOKEN, if set, should contain a GitLab access token
    • GITLAB_OAUTH, if set, should contain GitLab OAuth credentials
    • If both are set, GITLAB_TOKEN will take precedence
  • GitLab root resource: either GITLAB_GROUP or GITLAB_PROJECT must be set
    • GITLAB_GROUP, if set, should contain a fully-qualified path to a GitLab group
    • GITLAB_PROJECT, if set, should contain a fully-qualified path to a GitLab project
  • Checkpoint job id: CHECKPOINT_JOB_ID must be set
    • This should be unique per GitLab job section
    • Is used as the identifier for storing where a job last left off

Optional

  • GITLAB_URL: override the GitLab instance from which data should be exported
    • Defaults to https://gitlab.com if unset
  • GITLAB_BACKFILL_TIMESTAMP: if set, overrides the default historical backfill behavior by limiting the updated_after time of GitLab resources. Accepted values are either now or a timestamp in %Y-%m-%d %H:%M:%S format. now disables backfill completely, while a timestamp limits backfill
  • GITLAB_NO_RECURSION: if set (to any value), completely disables group recursion

Data storage design

Data is treated as append-only. Once written, records are never modified or deleted. In addition to the data retrieved from the API, every record is enriched with a data_date field that users can rely upon in analytical queries, representing the time at which the records were loaded into BigQuery.

The "latest" state for each resource can always be found by grouping by id and selecting the greatest updated_at. Because all written records contain all information returned by the API, this design provides easy access to the latest state of each record while also providing a basic form of historical "snapshots".

For more granular history, users need only run gitlab-data-export more frequently. Naturally, more granular history also implies a greater level of data duplication - choose your export frequency accordingly. For most use-cases, daily export should be sufficient.

Supported data stores

Each supported data store may carry its own bespoke requirements and configurations. These details are described for each implemented store.

BigQuery

The BigQuery backend requires the following configurations:

  • Authentication: either GOOGLE_APPLICATION_CREDENTIALS or BIGQUERY_SERVICE_ACCOUNT_JSON must be set
  • BIGQUERY_PROJECT: the globally unique project id for the BigQuery-enabled GCP project to use for data storage

The following optional configurations are also supported:

  • BIGQUERY_DATASET: can be set to override the dataset name that will be populated with GitLab data
    • Defaults to gitlab_data_export if unset
    • Note that the dataset name will always have the GitLab API version appended to it for future compatibility purposes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlab_data_export-0.4.0.tar.gz (18.1 kB view hashes)

Uploaded Source

Built Distribution

gitlab_data_export-0.4.0-py3-none-any.whl (18.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page