Skip to main content

Tool to search and fetch code from GitHub

Project description

# bigcode-fetcher

A utility to search and fetch code from GitHub. This tool was build to easily create datasets for repository analysis.

The tool works in two phases, search finds repositories using the GitHub API, and saves the result in a JSON file. download fetch all the repositories inside the JSON file.

## Install

This tool can be installed by running

` pip install bigcode-fetcher `

or by fetching this repository and running

` pip install . `

in this directory.

## Usage

### search command

By default, the utility searches for repositories fulfilling the following conditions

  • size between 1M and 100M

  • stars count > 10

  • non-viral license (MIT,Apache-2.0,MPL-2.0,BSD-2-Clause,BSD-3-Clause,BSD-4-Clause,MS-PL)

and retrieves the first 100 projects, ordered by number of stars.

To avoid API rate limiting, an access token can be provided either with the –token CLI argument or with the GITHUB_TOKEN environment variable.

See the help to see all the options:

` bigcode-fetcher search -h `

#### Example

Search for all Apache commons projects written in Java

` mkdir -p apache-common-projects bigcode-fetcher search --language Java --user apache --stars '>0' --keyword commons --max-repos 500 -o apache-common-projects/apache-commons.json `

### download command

This commands will simply git clone all the repositories in the JSON generated by the search command.

To reduce the download size, only the latest revision is fetched by default (i.e. git clone –depth 1). This can be disabled by passing in the –full flag.

USERNAME/REPO will be fetched in OUTPUT_DIR/USERNAME/REPO, where OUTPUT_DIR is set by the –output option.

The command will ignore the project if the directory already exists, so running the command multiple times is safe, and recommended to make sure all repositories have been fetched.

See the help for more information:

` bigcode-fetcher download -h `

#### Example

Download all the Apache commons project generated above

` mkdir -p apache-common-projects/repositories bigcode-fetcher download -i apache-common-projects/apache-commons.json -o apache-common-projects/repositories `

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigcode-fetcher-0.1.2.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

bigcode_fetcher-0.1.2-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file bigcode-fetcher-0.1.2.tar.gz.

File metadata

File hashes

Hashes for bigcode-fetcher-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3c24fd921cc86d3b327ad9ab99faa10eafd40b948368d76f3abff9a40dbf1524
MD5 05f22a4be3b1f401a497cc5eaeb7ca46
BLAKE2b-256 97ba16e36d081a5c03ce21e411a98999375c5eb959d92d6e7405838fbcf9cd76

See more details on using hashes here.

File details

Details for the file bigcode_fetcher-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for bigcode_fetcher-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 97d929d59d68a39fd59dbd37acd2163357eb3343af2dd19c64ad5ec01cc900ad
MD5 b38e3b230424bb402307cb7e078f97a0
BLAKE2b-256 c3ffb7e4d79f7eb0c02cb1675e3173d109d447ff36f8b2298cd5f50f837d50f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page