Load git stats into a DuckDB db locally for easy analytics.
Project description
Git DB
Making sense of GitHub projects is harder than it should be.
- Default
insights
are poor and do give us much information on how the project is evolving. - The API model is not straightforward (issues & PRs are handled together, some calls are easier on v3 than v4,...).
- The typical approach consists on reading the API and computing metrics on-the-fly. This is too strict and new insights have a high cost (API calls).
Our goal with Git DB is to focus on simplicity and flexibility:
- We dump all the project's data to a local Duck DB.
- Your answers are one SQL query away.
Installation
pip install PyGitDB
What will you find?
stargazers
Column Name | Type |
---|---|
id | INTEGER |
starred_at | TIMESTAMP |
user_id | INTEGER |
user_login | VARCHAR |
issues
Column Name | Type |
---|---|
id | INTEGER |
number | INTEGER |
title | VARCHAR |
body | VARCHAR |
user_id | INTEGER |
user_login | VARCHAR |
state | VARCHAR |
assignees | VARCHAR[] |
labels | VARCHAR[] |
created_at | TIMESTAMP |
updated_at | TIMESTAMP |
closed_at | TIMESTAMP |
author_association | VARCHAR |
html_uri | VARCHAR |
pulls
Column Name | Type |
---|---|
id | INTEGER |
number | INTEGER |
title | VARCHAR |
body | VARCHAR |
user_id | INTEGER |
user_login | VARCHAR |
state | VARCHAR |
assignees | VARCHAR[] |
labels | VARCHAR[] |
created_at | TIMESTAMP |
updated_at | TIMESTAMP |
merged_at | TIMESTAMP |
closed_at | TIMESTAMP |
author_association | VARCHAR |
html_uri | VARCHAR |
contributors
Column Name | Type |
---|---|
id | INTEGER |
contributions | INTEGER |
user_id | INTEGER |
user_login | VARCHAR |
latest_reviews
Column Name | Type |
---|---|
id | INTEGER |
number | INTEGER |
reviewer | VARCHAR |
review_state | VARCHAR |
reviewed_at | TIMESTAMP |
weekly_commits
Column Name | Type |
---|---|
id | INTEGER |
date | TIMESTAMP |
commits | INTEGER |
How to run
1. CLI
Once installed, you can prepare the database for your GitHub project directly with the CLI:
❯ gitdb --help х INT Py 3.9.13 11:36:05
usage: gitdb [-h] [-r REPO] [-o OWNER] [-f FILE] [--clean]
optional arguments:
-h, --help show this help message and exit
-r REPO, --repo REPO Repo to analyze
-o OWNER, --owner OWNER
Repo owner
-f FILE, --file FILE Db file path
--clean Clean the existing db
For example:
❯ gitdb -r OpenMetadata -o open-metadata 3s Py 3.9.13 11:37:04
2023-05-01 11:38:01.472 | INFO | gitdb.main:init:30 - Starting GitDB in gitdb.db...
2023-05-01 11:38:01.473 | INFO | gitdb.api.client:__init__:45 - Preparing client with root='api.github.com' owner='open-metadata' repo='OpenMetadata' token=SecretStr('**********') start_date='Aug 1 2021' timeout=300 graphql='graphql'
2023-05-01 12:31:29.240 | INFO | gitdb.dao.core:process:36 - Starting to process ReviewsDAO...
2023-05-01 12:31:29.240 | INFO | gitdb.dao.core:process:36 - Starting to process IssuesDAO...
2023-05-01 12:31:29.241 | INFO | gitdb.dao.core:process:36 - Starting to process StarsDAO...
2023-05-01 12:31:29.241 | INFO | gitdb.dao.core:process:36 - Starting to process WeeklyCommitsDAO...
2023-05-01 12:31:29.241 | INFO | gitdb.dao.core:process:36 - Starting to process ContributorsDAO...
2023-05-01 12:31:44.356 | INFO | gitdb.main:init:54 - Loaded all data in 0.26767790695000004 min.
2. Run from Python
If instead, you want to call the database generation from another Python program, you can use the following:
from gitdb.main import init
session = init(
repo=...,
owner=...,
token=...,
path=...,
)
The init
method will create the Duck DB database and will give you the SQLAlchemy
Session to start running
your queries.
Examples
Connect to the database using duckdb
and start running queries:
import duckdb
conn = duckdb.connect(database="gitdb.db", read_only=True)
conn.execute("show tables").fetchall()
A typical question is wanting to see the evolution of the stars by week. This can be achieved with the following query:
WITH CTE AS (
SELECT
strftime(starred_at - INTERVAL (DAYOFWEEK(starred_at) - 1) DAY, '%Y/%m/%d') as starred_week,
count(id) as stars
from stargazers
group by strftime(starred_at - INTERVAL (DAYOFWEEK(starred_at) - 1) DAY, '%Y/%m/%d')
)
select
starred_week as week,
SUM(stars) over (ORDER BY starred_week ASC) as stars_by
FROM CTE
ORDER BY starred_week ASC
How does this work?
We are running a bunch of calls against the GitHub API when dumping all the data against the db. Doing this sequentially
can be rather long. Our client
has a get_all_parallel
function that accepts a number of threads as a parameter to do calls
in parallel by playing with totals and pagination.
The big chunk of work has gone into preparing the init
call to be as fast as possible by leveraging multithreading in the host.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file PyGitDB-0.1.0.tar.gz
.
File metadata
- Download URL: PyGitDB-0.1.0.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f326ab8bcd42ada52604db834b82719988f13083a88f63cabd70dc5947e6faa6 |
|
MD5 | da643d1d72a5b3737cc45002899f8d6e |
|
BLAKE2b-256 | 6cd7c40c335622b1c6ea86b5c450bd98c65b7d8e9a29649cf7d8bce6e4e5b2f1 |
File details
Details for the file PyGitDB-0.1.0-py3.9.egg
.
File metadata
- Download URL: PyGitDB-0.1.0-py3.9.egg
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a3128a203af8f36fc20eeb326ea8ad10b627d32bf84cd2e336f257fa967fa1f |
|
MD5 | ad1e1d82a5ae3d59d45344fd6fb2d6ea |
|
BLAKE2b-256 | 8cab0a4cc72e5925ba36d8848383133425aeed7be888e3bf3922fa0a455899e7 |
File details
Details for the file PyGitDB-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: PyGitDB-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ef501c5ae4922642f2f4110e96d29da3e44436785b61159112e76b9bbf99620 |
|
MD5 | e2d5cc008a718a0482ff513abca7bee9 |
|
BLAKE2b-256 | b753d53d1a3cb29fef2d19f705797060f2f8c84ae60f87621c9b4ef0c9b82d26 |