Skip to main content

Github mining tool for MSR research

Project description

A library and toolkit for MSR research.

Mining software repository has been a popular research method for quite long time. Although github offers convenient public REST and GraphQL API, collecting large scale dataset with long history of information such as repository, author, bot, issues, pull request, comment is still a non-trivial task. There are three major challenges to be solved in order to retrieve large search results from github:

  • 1000-limit issue: github API discards records beyond 1000 in the result set of a particular query.

  • rate-limit issue: github API prevents authenticated personal accounts from invoking API more than 5000 times per hour.

  • pagination: User has to issue multiple API calls to retrieve the complete query results over 100 records.

When the client exceeds the rate limit, it is disconnected with HTTP status code 503. Without proper recover handling, data collection process is subject to frequent interruptions.

This library and assoicated scripts are intended to help solve the three challenges so that you can focus on the data mining rather than data collection.

Requirements

  • Python 3.7 over

Features

  • Search Github repositories based on stars, fork, language and topic

  • Search a large number of repositories by dividing creation time into small time window

  • Support multiple topics with OR relation

  • Build dataset in .csv and .parquet format

  • Retrieve commit, issue comments

  • Golang miner with go.mod retrieval and parsing

Setup

$ python -m venv /path/to/venv
$ /path/to/venv/bin/python -m pip install ghminer

Usage

To identify repositories for your MSR research, please refer to the script identify-repos.py. To retrieve commits, use the script retrieve-commits.py. To mine golang projects, use the script golang-miner.

>>> from ghminer.retriever import collect_data
>>> collect_data(
        2022, 2023, None, True, 100, 15,
        "repo.d", "java", trace=trace
    )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghminer-0.1.8.tar.gz (24.7 kB view hashes)

Uploaded Source

Built Distribution

ghminer-0.1.8-py3-none-any.whl (17.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page