Skip to main content

Convenient access to massive corpus of GitHub repositories

Project description

MAGI Dataset

Install

pip install magi_dataset

If you plan on using magi_dataset to periodically crawl data, set the following variables in your environment:

export GH_TOKEN="Your token"

Read Creating a personal access token for more information on creating GitHub personal access token. If using the default data without crawling new data, you may safely ignore this token.

Usage

Initialize an empty instance and collect data:

>>> from magi_dataset import GitHubDataset

>>> github_dataset = GitHubDataset(
...     empty = True
... )

github_dataset.init_repos(fully_initialize=True)

Download default data (not guranteed to be the newest):

>>> from magi_dataset import GitHubDataset

>>> github_dataset3 = GitHubDataset(
...	    empty = False
... )

The default data may be found at Enoch2090/github_semantic_search on HuggingFace. We will update the data periodically.

After the dataset is created, access the data with either number index:

>>> github_dataset[5]
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)

Or the full name:

>>> github_dataset['ytdl-org/youtube-dl']
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)

And you can access the corpus by accessing the readme and hn_comments attributes of GitHubRepo objects.

>>> github_dataset[5].readme[0:100]
'[![Build Status](https://github.com/ytdl-org/youtube-dl/workflows/CI/badge.svg)](https'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

magi_dataset-1.0.0-py3-none-any.whl (9.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page