Skip to main content

Repository mining tool for structuring Git metadata at scale.

Project description

diffhouse: Repository Mining at Scale

PyPI Python version Test status

Documentation

diffhouse is a Python solution for structuring Git metadata, designed to enable large-scale codebase analysis at practical speeds.

Key features are:

  • Fast access to commit data, file changes and more
  • Easy integration with pandas and polars
  • Simple-to-use Python interface

Requirements

Requires Git 2.22 or higher to be available in the system PATH.

Performance

tweenjs/tween.js sqlflow scrapy
Commits Files Diffs Commits Files Diffs Commits Files Diffs
PyDriller 1.5s 6.5s 6.7s 8.8s 22.8s 23.6s
diffhouse

Limitations

At its core, diffhouse is a data extraction tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at PyDriller instead.

Also note that revision data is limited to default branches only.

User Guide

This guide aims to cover the basic use cases of diffhouse. For the list of available repository objects and fields, check out the API Reference.

Installation

Install diffhouse through PyPi:

pip install diffhouse

Quickstart

from diffhouse import Repo

url = 'https://github.com/user/repo'

r = Repo(location = url, blobs = True).load()

for c in r.commits:
    print(c.commit_hash[:10], c.committer_date, c.author_email)

print(r.branches)
print(r.diffs[0].to_dict())

First, construct a Repo object and define its target repository via the location argument; this can be either a remote URL or a local path. Pass blobs = True to extract file data as well.

Calling Repo.load() will load all metadata into memory, which can then be accessed through the object's properties. See all properties

blobs = True requires a complete clone of the repository and therefore takes longer to execute. Omit this argument whenever possible.

Lazy Loading

For large repositories, calling load() can be slow and/or take up gigabytes of memory. It is recommended to use the lazy method via with instead:

with Repo(location = url, blobs = True) as r:
    c = list(r.stream_commits())

    for d in r.stream_diffs():
        if d.lines_added == 3:
            break

This brings two big benefits:

  1. Object streaming functions are lazy generators, allowing for efficient memory use.
  2. No processing power is spent on objects that are not explicitly requested.

See all streaming functions

Tabular Data

Commit, ChangedFile and Diff iterables can be passed directly to pandas and polars DataFrame constructors. No pre-processing is needed; table schemas will be inferred correctly.

import polars as pl

df = pl.DataFrame(r.changed_files)
print(df.schema)

diffhouse stores datetime values as ISO 8601 strings to preserve time zone offsets. When converting these to datetime objects in a DataFrame, use the parser's UTC option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffhouse-1.0.1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diffhouse-1.0.1-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file diffhouse-1.0.1.tar.gz.

File metadata

  • Download URL: diffhouse-1.0.1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for diffhouse-1.0.1.tar.gz
Algorithm Hash digest
SHA256 25e2b34cf8a11d26a13059c0b683ec0690f1f1f776e0bb56d2bb95dd9cc78431
MD5 dc70b3f24e294288dfd7f716b0d8414d
BLAKE2b-256 02747492cd163e81e9a58cb78848175abe7edf6570199e152d75257179e9e7c7

See more details on using hashes here.

File details

Details for the file diffhouse-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: diffhouse-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for diffhouse-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e764fe43cd6b0da776c6e8cd57180680c36b8c36e121b567c38a8432eaa7151a
MD5 bf7ba27c67135d3264d9d0d27afbf4aa
BLAKE2b-256 aad306e05204096e2de4905228e91c793e8f467220d8dc0b7e6bfc51fd2b8a4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page