Git metadata extractor for analysts.
Project description
diffhouse
diffhouse is a Git metadata extraction tool for Python, designed to enable large-scale repository analyses. Key features are:
- Fast access to commit data, file changes and more
- Easy integration with
pandasandpolars - Simple-to-use Python interface
Requirements
Requires Git 2.22 and Python 3.10 or higher.
User Guide
This guide aims to cover the basic use cases of diffhouse. For the list of available repository objects and fields, check out the API Reference.
Installation
Install diffhouse through PyPi:
pip install diffhouse
Quickstart
After importing the package in Python, construct a Repo object and define
its target repository via the location argument; this can be either a
remote URL or a local path. Pass blobs = True to extract file data as well.
Calling Repo.load() will load all requested metadata into memory, which can
then be accessed through the object's
properties.
blobs = Truerequires a complete clone of the repository and therefore takes longer to execute. Omit this argument whenever possible.
Example: Basic Querying
from diffhouse import Repo
r = Repo(
location = 'https://github.com/octocat/Hello-World',
blobs = True
).load()
for c in r.commits:
print(c.commit_hash[:10], c.committer_date, c.author_email)
print(r.branches)
outputs:
7fd1a60b01 2012-03-06T15:06:50-08:00 octocat@nowhere.com
762941318e 2011-09-13T21:42:41-07:00 Johnneylee.rollins@gmail.com
553c2077f0 2011-01-26T11:06:08-08:00 cameron@github.com
['master', 'octocat-patch-1', 'test']
Tabular Data
commits, changed_files and diffs iterables can be passed directly to
pandas and polars DataFrame constructors. No pre-processing is needed;
table schemas will be inferred correctly.
Example: Using Polars
import polars as pl
df = pl.DataFrame(r.changed_files)
df.schema
outputs:
Schema([('commit_hash', String),
('path_a', String),
('path_b', String),
('changed_file_id', String),
('change_type', String),
('similarity', Int64),
('lines_added', Int64),
('lines_deleted', Int64)])
diffhouse stores datetime values as ISO 8601 strings to preserve time zone offsets. When converting these to datetime objects in a
DataFrame, use the parser's UTC option.
Lazy Loading
For large repositories (100k+ commits), passing blobs = True and calling
load() can take up gigabytes of memory; in these cases, it's better to use
the lazy method:
with Repo(
location='https://github.com/octocat/Hello-World',
blobs = True
) as r:
for d in r.diffs:
if d.lines_added == 3:
break
This has two benefits:
- Data is only loaded for accessed properties.
- Properties act as lazy iterators, only loading one record at a time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diffhouse-0.3.0.tar.gz.
File metadata
- Download URL: diffhouse-0.3.0.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8d641a19f29090b72190ef8396828493d59b099020779121985355f5065a512
|
|
| MD5 |
c9a3fd455ff8f41de0f5d72e86a418d8
|
|
| BLAKE2b-256 |
762db8e4f43e037361f4791219621565702b8f779976edca960841eb0e939ba8
|
File details
Details for the file diffhouse-0.3.0-py3-none-any.whl.
File metadata
- Download URL: diffhouse-0.3.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b861eca86cdd9c98f9ecfe5a9c16354cc732d5a9fc42b4080aa01aa59694135b
|
|
| MD5 |
59cdae152cbeeaa01984e405dee27194
|
|
| BLAKE2b-256 |
b0ebcb5deee45c8d2a1a67dfceae25cca8f1a4ca9567b9d15ec35d74f95429c5
|