Skip to main content

Git metadata extractor for analysts.

Project description

diffhouse

diffhouse is a Git metadata extraction tool for Python, designed to enable large-scale repository analyses. Key features are:

  • Fast access to commit data, file changes and more
  • Easy integration with pandas and polars
  • Simple-to-use Python interface

Requirements

Requires Git 2.22 and Python 3.10 or higher.

User Guide

This guide aims to cover the basic use cases of diffhouse. For the list of available repository objects and fields, check out the API Reference.

Installation

Install diffhouse through PyPi:

pip install diffhouse

Quickstart

After importing the package in Python, construct a Repo object and define its target repository via the location argument; this can be either a remote URL or a local path. Pass blobs = True to extract file data as well.

Calling Repo.load() will load all requested metadata into memory, which can then be accessed through the object's properties.

blobs = True requires a complete clone of the repository and therefore takes longer to execute. Omit this argument whenever possible.

Example: Basic Querying

from diffhouse import Repo

r = Repo(
    location = 'https://github.com/octocat/Hello-World',
    blobs = True
).load()

for c in r.commits:
    print(c.commit_hash[:10], c.committer_date, c.author_email)

print(r.branches)

outputs:

7fd1a60b01 2012-03-06T15:06:50-08:00 octocat@nowhere.com
762941318e 2011-09-13T21:42:41-07:00 Johnneylee.rollins@gmail.com
553c2077f0 2011-01-26T11:06:08-08:00 cameron@github.com
['master', 'octocat-patch-1', 'test']

Tabular Data

commits, changed_files and diffs iterables can be passed directly to pandas and polars DataFrame constructors. No pre-processing is needed; table schemas will be inferred correctly.

Example: Using Polars

import polars as pl

df = pl.DataFrame(r.changed_files)
df.schema

outputs:

Schema([('commit_hash', String),
        ('path_a', String),
        ('path_b', String),
        ('changed_file_id', String),
        ('change_type', String),
        ('similarity', Int64),
        ('lines_added', Int64),
        ('lines_deleted', Int64)])

diffhouse stores datetime values as ISO 8601 strings to preserve time zone offsets. When converting these to datetime objects in a DataFrame, use the parser's UTC option.

Lazy Loading

For large repositories (100k+ commits), passing blobs = True and calling load() can take up gigabytes of memory; in these cases, it's better to use the lazy method:

with Repo(
    location='https://github.com/octocat/Hello-World',
    blobs = True
) as r:
    for d in r.diffs:
        if d.lines_added == 3:
            break

This has two benefits:

  1. Data is only loaded for accessed properties.
  2. Properties act as lazy iterators, only loading one record at a time.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffhouse-0.3.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diffhouse-0.3.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file diffhouse-0.3.0.tar.gz.

File metadata

  • Download URL: diffhouse-0.3.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for diffhouse-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c8d641a19f29090b72190ef8396828493d59b099020779121985355f5065a512
MD5 c9a3fd455ff8f41de0f5d72e86a418d8
BLAKE2b-256 762db8e4f43e037361f4791219621565702b8f779976edca960841eb0e939ba8

See more details on using hashes here.

File details

Details for the file diffhouse-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: diffhouse-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for diffhouse-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b861eca86cdd9c98f9ecfe5a9c16354cc732d5a9fc42b4080aa01aa59694135b
MD5 59cdae152cbeeaa01984e405dee27194
BLAKE2b-256 b0ebcb5deee45c8d2a1a67dfceae25cca8f1a4ca9567b9d15ec35d74f95429c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page