Skip to main content

Load git stats into a DuckDB db locally for easy analytics.

Project description

Git DB

Making sense of GitHub projects is harder than it should be.

  1. Default insights are poor and do give us much information on how the project is evolving.
  2. The API model is not straightforward (issues & PRs are handled together, some calls are easier on v3 than v4,...).
  3. The typical approach consists on reading the API and computing metrics on-the-fly. This is too strict and new insights have a high cost (API calls).

Our goal with Git DB is to focus on simplicity and flexibility:

  • We dump all the project's data to a local Duck DB.
  • Your answers are one SQL query away.

Installation

pip install PyGitDB

What will you find?

stargazers

Column Name Type
id INTEGER
starred_at TIMESTAMP
user_id INTEGER
user_login VARCHAR

issues

Column Name Type
id INTEGER
number INTEGER
title VARCHAR
body VARCHAR
user_id INTEGER
user_login VARCHAR
state VARCHAR
assignees VARCHAR[]
labels VARCHAR[]
created_at TIMESTAMP
updated_at TIMESTAMP
closed_at TIMESTAMP
author_association VARCHAR
html_uri VARCHAR

pulls

Column Name Type
id INTEGER
number INTEGER
title VARCHAR
body VARCHAR
user_id INTEGER
user_login VARCHAR
state VARCHAR
assignees VARCHAR[]
labels VARCHAR[]
created_at TIMESTAMP
updated_at TIMESTAMP
merged_at TIMESTAMP
closed_at TIMESTAMP
author_association VARCHAR
html_uri VARCHAR

contributors

Column Name Type
id INTEGER
contributions INTEGER
user_id INTEGER
user_login VARCHAR

latest_reviews

Column Name Type
id INTEGER
number INTEGER
reviewer VARCHAR
review_state VARCHAR
reviewed_at TIMESTAMP

weekly_commits

Column Name Type
id INTEGER
date TIMESTAMP
commits INTEGER

How to run

1. CLI

Once installed, you can prepare the database for your GitHub project directly with the CLI:

 gitdb --help                                                                                                                                                                                                                                                                                                                                                                 х INT Py 3.9.13 11:36:05
usage: gitdb [-h] [-r REPO] [-o OWNER] [-f FILE] [--clean]

optional arguments:
  -h, --help            show this help message and exit
  -r REPO, --repo REPO  Repo to analyze
  -o OWNER, --owner OWNER
                        Repo owner
  -f FILE, --file FILE  Db file path
  --clean               Clean the existing db

For example:

 gitdb -r OpenMetadata -o open-metadata                                                                                                                                                                                                                                                                                                                                          3s Py 3.9.13 11:37:04
2023-05-01 11:38:01.472 | INFO     | gitdb.main:init:30 - Starting GitDB in gitdb.db...
2023-05-01 11:38:01.473 | INFO     | gitdb.api.client:__init__:45 - Preparing client with root='api.github.com' owner='open-metadata' repo='OpenMetadata' token=SecretStr('**********') start_date='Aug 1 2021' timeout=300 graphql='graphql'
2023-05-01 12:31:29.240 | INFO     | gitdb.dao.core:process:36 - Starting to process ReviewsDAO...
2023-05-01 12:31:29.240 | INFO     | gitdb.dao.core:process:36 - Starting to process IssuesDAO...
2023-05-01 12:31:29.241 | INFO     | gitdb.dao.core:process:36 - Starting to process StarsDAO...
2023-05-01 12:31:29.241 | INFO     | gitdb.dao.core:process:36 - Starting to process WeeklyCommitsDAO...
2023-05-01 12:31:29.241 | INFO     | gitdb.dao.core:process:36 - Starting to process ContributorsDAO...
2023-05-01 12:31:44.356 | INFO     | gitdb.main:init:54 - Loaded all data in 0.26767790695000004 min.

2. Run from Python

If instead, you want to call the database generation from another Python program, you can use the following:

from gitdb.main import init

session = init(
   repo=...,
   owner=...,
   token=...,
   path=...,
)

The init method will create the Duck DB database and will give you the SQLAlchemy Session to start running your queries.

Examples

Connect to the database using duckdb and start running queries:

import duckdb

conn = duckdb.connect(database="gitdb.db", read_only=True)
conn.execute("show tables").fetchall()

A typical question is wanting to see the evolution of the stars by week. This can be achieved with the following query:

WITH CTE AS (
   SELECT 
       strftime(starred_at - INTERVAL (DAYOFWEEK(starred_at) - 1) DAY, '%Y/%m/%d') as starred_week,
       count(id) as stars 
   from stargazers 
   group by strftime(starred_at - INTERVAL (DAYOFWEEK(starred_at) - 1) DAY, '%Y/%m/%d')
)
select
   starred_week as week,
   SUM(stars) over (ORDER BY starred_week ASC) as stars_by
FROM CTE
ORDER BY starred_week ASC

How does this work?

We are running a bunch of calls against the GitHub API when dumping all the data against the db. Doing this sequentially can be rather long. Our client has a get_all_parallel function that accepts a number of threads as a parameter to do calls in parallel by playing with totals and pagination.

The big chunk of work has gone into preparing the init call to be as fast as possible by leveraging multithreading in the host.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyGitDB-0.1.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distributions

PyGitDB-0.1.0-py3.9.egg (48.3 kB view details)

Uploaded Source

PyGitDB-0.1.0-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file PyGitDB-0.1.0.tar.gz.

File metadata

  • Download URL: PyGitDB-0.1.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for PyGitDB-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f326ab8bcd42ada52604db834b82719988f13083a88f63cabd70dc5947e6faa6
MD5 da643d1d72a5b3737cc45002899f8d6e
BLAKE2b-256 6cd7c40c335622b1c6ea86b5c450bd98c65b7d8e9a29649cf7d8bce6e4e5b2f1

See more details on using hashes here.

File details

Details for the file PyGitDB-0.1.0-py3.9.egg.

File metadata

  • Download URL: PyGitDB-0.1.0-py3.9.egg
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for PyGitDB-0.1.0-py3.9.egg
Algorithm Hash digest
SHA256 9a3128a203af8f36fc20eeb326ea8ad10b627d32bf84cd2e336f257fa967fa1f
MD5 ad1e1d82a5ae3d59d45344fd6fb2d6ea
BLAKE2b-256 8cab0a4cc72e5925ba36d8848383133425aeed7be888e3bf3922fa0a455899e7

See more details on using hashes here.

File details

Details for the file PyGitDB-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: PyGitDB-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for PyGitDB-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ef501c5ae4922642f2f4110e96d29da3e44436785b61159112e76b9bbf99620
MD5 e2d5cc008a718a0482ff513abca7bee9
BLAKE2b-256 b753d53d1a3cb29fef2d19f705797060f2f8c84ae60f87621c9b4ef0c9b82d26

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page