Parallelized web scraper for Github

These details have not been verified by PyPI

Project links

Homepage

Project description

git-pull

git-pull is a web scraper for Github. You can use it to scrape –– or, if you will, pull –– data from a Github profile, repo, or file. It's parallelized and designed for anyone who wants to avoid using the Github API (e.g. due to the rate limit). Using it is very simple:

from git_pull import GithubProfile

gh = GithubProfile("shobrook")
gh.scrape_follower_count() # >>> 168

Note that git-pull is not a perfect replacement for the Github API. There's some stuff that it can't scrape (yet), like a repo's commit history or release count.

Installation

You can install git-pull with pip:

$ pip install git-pull

Usage

git-pull provides three objects –– GithubProfile, Repo, and File –– each with methods for scraping data. Below are descriptions and usage examples for each object.

`GithubProfile(username, num_threads=cpu_count(), scrape_everything=False)`

This is the master object for scraping data from a Github profile. All it requires is the username of the Github user, and from there you can scrape social info for that user and their repos.

Parameters:

username (str): Github username
num_threads (int, optional (default=multiprocessing.cpu_count())): Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPU
scrape_everything (bool, optional (default=False)): If True, does a "deep scrape" and scrapes all social info and repo data for the user (i.e. it calls all the scraper methods listed below and stores the results in properties of the object); if False, you have to call individual scraper methods to get the data you want

Methods:

scrape_name() -> str: Returns the name of the Github user
scrape_avatar() -> str: Returns a URL for the user's profile picture
scrape_follower_count() -> int: Returns the number of followers the user has
scrape_contribution_graph() -> dict: Returns the contribution history for the user as a map of dates (as strings) to commit counts
scrape_location() -> str: Returns the user's location, if available
scrape_personal_site() -> str: Returns the URL of the user's website, if available
scrape_workplace() -> str: Returns the name of the user's workplace, if available
scrape_repos(scrape_everything=False) -> list: Returns list of Repo objects for each of the user's repos (both source and forked); if scrape_everything=True, then a "deep scrape" is performed for each repo
scrape_repo(repo_name, scrape_everything=False) -> Repo: Returns a single Repo object for a given repo that the user owns

Example:

from git_pull import GithubProfile

# If scrape_everything=True, then all scraped data is stored in object
# properties
gh = GithubProfile("shobrook", scrape_everything=True)
gh.name # >>> "Jonathan Shobrook"
gh.avatar # >>> "https://avatars1.githubusercontent.com/u/18684735?s=460&u=60f797085eb69d8bba4aba80078ad29bce78551a&v=4"
gh.repos # >>> [Repo("git-pull"), Repo("saplings"), ...]

# If scrape_everything=False, individual scraper methods have to be called, each
# of which both returns the scraped data and stores it in the object properties
gh = GithubProfile("shobrook", scrape_everything=False)
gh.name # >>> ''
gh.scrape_name() # >>> "Jonathan Shobrook"
gh.name # >>> "Jonathan Shobrook"

`Repo(name, owner, num_threads=cpu_count(), scrape_everything=False)`

Use this object for scraping data from a Github repo.

Parameters:

name (str): Name of the repo to be scraped
owner (str): Username of the owner of the repo
num_threads (int, (optional, default=multiprocessing.cpu_count())): Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPU
scrape_everything (bool, (optional, default=False)): If True, scrapes all metadata for the repo and scrapes files; if False, you have to call individual scraper methods to get the data you want

Methods:

scrape_topics() -> list: Returns list of topics/tags for the repo
scrape_star_count() -> int: Returns number of stars the repo has
scrape_fork_count() -> int: Returns number of times the repo has been forked
scrape_fork_status() -> bool: Returns whether or not the repo is a fork of another one
scrape_files(scrape_everything=False) -> list: Returns a list of File objects, each representing a file in the repo; files that aren't programs or documentation files (e.g. boilerplate) are not scraped
scrape_file(file_path, file_type=None, scrape_everything=False) -> File: Returns a File object given a file path

Example:

from git_pull import Repo

repo = Repo("git-pull", "shobrook", scrape_everything=True)
repo.topics # >>> ["web-scraper", "github", "github-api", "parallel", "scraper"]
repo.fork_status # >>> False

`File(path, repo, owner, scrape_everything=False)`

Use this object for scraping data from a single file inside a Github repo.

Parameters:

path (str): Absolute path of the file inside the repo
repo (str): Name of the repo containing the file
owner (str): Username of the repo's owner
scrape_everything (bool, (optional, default=False)): If True, scrapes the blame history for the file and the file type (i.e. calls the methods listed below)

Methods:

scrape_blames() -> dict: Returns the blame history for a file as a map of usernames (i.e. contributors) to {"line_nums": [1, 2, ...], "committers": [...]} dictionaries, where "line_nums" is a list of line numbers the user wrote and "committers" is a list of usernames of contributors the user pair programmed with, if any

Example:

from git_pull import File

file = File("git_pull/git_pull.py", "git-pull", "shobrook", scrape_everything=True)
file.blames # >>> {"shobrook": {"line_nums": [1, 2, ...], "committers": []}}
file.raw_url # >>> "https://raw.githubusercontent.com/shobrook/git-pull/master/git_pull/git_pull.py"
file.type # >>> "Python"

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.4

Jan 3, 2021

1.0.3

Jan 3, 2021

1.0.2

Jan 3, 2021

1.0.1

Jan 3, 2021

1.0.0

Jan 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git-pull-1.0.4.tar.gz (8.0 MB view details)

Uploaded Jan 3, 2021 Source

File details

Details for the file git-pull-1.0.4.tar.gz.

File metadata

Download URL: git-pull-1.0.4.tar.gz
Upload date: Jan 3, 2021
Size: 8.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.9.1

File hashes

Hashes for git-pull-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`217a8ba8ac54cba5cf390080544169e79929bcfc38b10bbc1caf2e0bcd6dc2a9`
MD5	`ae8620173f10292bcd5e64815a94b09f`
BLAKE2b-256	`84e3a81409cc9ef92e1cb3922cbb180f19313536fca99ec7d28cd2c373263f2a`

See more details on using hashes here.

git-pull 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

git-pull

Installation

Usage

`GithubProfile(username, num_threads=cpu_count(), scrape_everything=False)`

`Repo(name, owner, num_threads=cpu_count(), scrape_everything=False)`

`File(path, repo, owner, scrape_everything=False)`

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes