Scrape top GitHub repositories and users based on keyword
Project description
Top Github Users Scraper
Scrape top Github repositories and users based on keywords.
Usage
Get Top Github Repositories' URLs
from top_github_scraper import get_top_urls
get_top_repos(keyword="machine learning", stop_page=20)
After running the script above, a file named
top_repo_urls_<keyword>_<start_page>_<end_page>.json
will be saved to your current directory.
Get Top Github Repositories' Information
from top_github_scraper import get_top_urls
get_top_urls("machine learning", stop_page=20)
After running the script above, 2 files named
top_repo_urls_<keyword>_<start_page>_<end_page>.json
top_repo_info_<keyword>_<start_page>_<end_page>.json
will be saved to your current directory.
Get Top Github Users' Profiles
from top_github_scraper import get_top_users
get_top_users("machine learning", stop_page=20)
After running the script above, 3 files named
top_repo_urls_<keyword>_<start_page>_<end_page>.json
top_repo_info_<keyword>_<start_page>_<end_page>.json
top_user_info_<keyword>_<start_page>_<end_page>.csv
will be saved to your current directory.
Parameters
- get_top_urls
keyword
: str Keyword to search for (.i.e, machine learning)save_path
: str, optional where to save the output file, by default"top_repo_urls"
start_page
: int, optional page number to start scraping from, by default0
stop_page
: int, optional page number of the last page to scrape, by default50
- get_top_repos
keyword
: str Keyword to search for (.i.e, machine learning)max_n_top_contributors
: int number of top contributors in each repository to scrape from, by default10
start_page
: int, optional page number to start scraping from, by default0
stop_page
: int, optional page number of the last page to scrape, by default50
url_save_path
: str, optional where to save the output file of URLs, by default"top_repo_urls"
repo_save_path
: str, optional where to save the output file of repositories' information, by default"top_repo_info"
- get_top_users
keyword
: str Keyword to search for (.i.e, machine learning)max_n_top_contributors
: int number of top contributors in each repository to scrape from, by default10
start_page
: int, optional page number to start scraping from, by default0
stop_page
: int, optional page number of the last page to scrape, by default50
url_save_path
: str, optional where to save the output file of URLs, by default"top_repo_urls"
repo_save_path
: str, optional where to save the output file of repositories' information, by default"top_repo_info"
user_save_path
: str, optional where to save the output file of users' profiles, by default"top_user_info"
How the Data is Scraped
top-github-scraper
scrapes the owners as well as the contributors of the top repositories that pop up in the search when searching for a specific keyword on GitHub.
For each user,
top-github-scraper
scrapes 16 data points:
login
: usernameurl
: URL of the usercontributions
: Number of contributions to the repository that the user is scraped fromstargazers_count
: Number of stars of the repository that the user is scraped fromforks_count
: Number of forks of the repository that the user is scraped fromtype
: Whether this account is a user or an organizationname
: Name of the usercompany
: User's companylocation
: User's locationemail
: User's emailhireable
: Whether the user is hireablebio
: Short description of the userpublic_repos
: Number of public repositories the user has (including forked repositories)public_gists
: Number of public repositories the user has (including forked gists)followers
: Number of followers the user hasfollowing
: Number of people the user is following
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for top_github_scraper-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 015d0dcf064d018127ae2cfc87f1f43127213d283dba51c0c42698a689f8197f |
|
MD5 | c450620256ce3d2b80d69514f850db44 |
|
BLAKE2b-256 | 9af6d7cf316416716ef775d7f113c019dbb9412059c31a8ba0c0b76bbf84f873 |