Convenient access to massive corpus of GitHub repositories
Project description
MAGI Dataset
Install
pip install magi_dataset
If you plan on using magi_dataset to periodically crawl data, set the following variables in your environment:
export GH_TOKEN="Your token"
Read Creating a personal access token for more information on creating GitHub personal access token. If using the default data without crawling new data, you may safely ignore this token.
Usage
Initialize an empty instance and collect data:
>>> from magi_dataset import GitHubDataset
>>> github_dataset = GitHubDataset(
... empty = True
... )
github_dataset.init_repos(fully_initialize=True)
Download default data (not guranteed to be the newest):
>>> from magi_dataset import GitHubDataset
>>> github_dataset3 = GitHubDataset(
... empty = False
... )
The default data may be found at Enoch2090/github_semantic_search on HuggingFace. We will update the data periodically.
After the dataset is created, access the data with either number index:
>>> github_dataset[5]
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)
Or the full name:
>>> github_dataset['ytdl-org/youtube-dl']
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)
And you can access the corpus by accessing the readme
and hn_comments
attributes of GitHubRepo
objects.
>>> github_dataset[5].readme[0:100]
'[![Build Status](https://github.com/ytdl-org/youtube-dl/workflows/CI/badge.svg)](https'
Future Works
- The current idle handler design is primordial, will switch to async pipelines to relieve CPU sleep time.
- Elasticsearch database builder
- Pinecone database builder (wrapper only)
- Hash verification of persistence files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for magi_dataset-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0113a647d0838261a1fbbe769128d3b04ce20b6597bb817b864ae67236144b3 |
|
MD5 | bd92a7188f61111dc8234e0050355230 |
|
BLAKE2b-256 | 3da0df3802148988179e55f71df461ecca7a11fbe3ca3211ed7bd957ec15fc3e |