Skip to main content

Github mining tool for MSR research

Project description

A library and toolkit for MSR research.

Mining software repository has been a popular research method for quite long time. Although github offers convenient public REST and GraphQL API, collecting large scale dataset with long history of information such as repository, author, bot, issues, pull request, comment is still a non-trivial task. There are three major challenges to be solved in order to retrieve large search results from github:

  • 1000-limit issue: github API discards records beyond 1000 in the result set of a particular query.

  • rate-limit issue: github API prevents authenticated personal accounts from invoking API more than 5000 times per hour.

  • pagination: User has to issue multiple API calls to retrieve the complete query results over 100 records.

When the client exceeds the rate limit, it is disconnected with HTTP status code 503. Without proper recover handling, data collection process is subject to frequent interruptions.

This library and assoicated scripts are intended to help solve the three challenges so that you can focus on the data mining rather than data collection.

Requirements

  • Python 3.7 over

Features

  • Search Github repositories based on stars, fork, language and topic

  • Search a large number of repositories by dividing creation time into small time window

  • Support multiple topics with OR relation

  • Build dataset in .csv and .parquet format

  • Retrieve commit, issue comments

  • Golang miner with go.mod retrieval and parsing

Setup

$ python -m venv /path/to/venv
$ /path/to/venv/bin/python -m pip install ghminer

Usage

To identify repositories for your MSR research, please refer to the script identify-repos.py. To retrieve commits, use the script retrieve-commits.py. To mine golang projects, use the script golang-miner.

>>> from ghminer.retriever import collect_data
>>> collect_data(
        2022, 2023, None, True, 100, 15,
        "repo.d", "java", trace=trace
    )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghminer-0.1.8.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghminer-0.1.8-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file ghminer-0.1.8.tar.gz.

File metadata

  • Download URL: ghminer-0.1.8.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for ghminer-0.1.8.tar.gz
Algorithm Hash digest
SHA256 f62fce3b4f3c37cb55bd2b078aceb9f4b4badfa47977e13986863b74de4755ab
MD5 8a6077dd444c88999f4079deb33b7121
BLAKE2b-256 96e0b1b1322d35f63728c86dae9b0e9d7a7093db24793cca035eac1890bcf425

See more details on using hashes here.

File details

Details for the file ghminer-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: ghminer-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for ghminer-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 33c83ec4da10dbc87a7b9b6146bef1ccadea925b3d80b4bb182bf7cacad83a7f
MD5 bf27bd6daca76606749ba985081f6163
BLAKE2b-256 65e8cd968b36cde319999b5c31d9f2ccadd653966333b78137d37174d6fca43a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page