Repository mining tool for structuring Git metadata at scale.
Project description
diffhouse: Repository Mining at Scale
diffhouse is a Python solution for structuring Git metadata, designed to enable large-scale codebase analysis at practical speeds.
Key features are:
- Fast access to commit data, file changes and more
- Easy integration with pandas and polars
- Simple-to-use Python interface
Requirements
| Python | 3.10 or higher |
| Git | 2.22 or higher |
Git also needs to be added to the system PATH.
Limitations
At its core, diffhouse is a data extraction tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at PyDriller instead.
Also note that revision data is limited to default branches only.
User Guide
This guide aims to cover the basic use cases of diffhouse. For the list of available repository objects and fields, check out the API Reference.
Installation
Install diffhouse through PyPi:
pip install diffhouse
Quickstart
from diffhouse import Repo
url = 'https://github.com/user/repo'
r = Repo(location = url, blobs = True).load()
for c in r.commits:
print(c.commit_hash[:10], c.committer_date, c.author_email)
print(r.branches)
print(r.diffs[0].to_dict())
First, construct a Repo object and define
its target repository via the location argument; this can be either a
remote URL or a local path. Pass blobs = True to extract file data as well.
Calling Repo.load() will load all metadata into memory, which can
then be accessed through the object's properties.
See all properties
blobs = Truerequires a complete clone of the repository and therefore takes longer to execute. Omit this argument whenever possible.
Lazy Loading
For large repositories, calling
load() can be slow and/or take up gigabytes of memory. It is recommended to
use the lazy method via with instead:
with Repo(location = url, blobs = True) as r:
c = list(r.stream_commits())
for d in r.stream_diffs():
if d.lines_added == 3:
break
This brings two big benefits:
- Object streaming functions are lazy generators, allowing for efficient memory use.
- No processing power is spent on objects that are not explicitly requested.
Tabular Data
Commit, ChangedFile and Diff iterables can be passed directly to
pandas and polars DataFrame constructors. No pre-processing is needed;
table schemas will be inferred correctly.
import polars as pl
df = pl.DataFrame(r.changed_files)
print(df.schema)
diffhouse stores datetime values as ISO 8601 strings to preserve time zone offsets. When converting these to datetime objects in a
DataFrame, use the parser's UTC option.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diffhouse-1.1.2.tar.gz.
File metadata
- Download URL: diffhouse-1.1.2.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65538439738faec17a89ffe673e5c6a733d250ba4ec8e6ef9dec6d6c00c501f2
|
|
| MD5 |
9a351560d2fa82f505dab9bc9f1fda89
|
|
| BLAKE2b-256 |
c58d6d4046a8fc15b7492e806be96cc3b39413367c34e8c738e8d887921c7a8f
|
File details
Details for the file diffhouse-1.1.2-py3-none-any.whl.
File metadata
- Download URL: diffhouse-1.1.2-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16b1d80e8cf81106ddc7cd51a0b76fe3f7ea4a4a7f9c525614cbbdd91caffd2c
|
|
| MD5 |
e8f8910cd94d2527777ae1ca2ec7e828
|
|
| BLAKE2b-256 |
c36fb34575b1160a6bd83a3c8188ed09acd5e39303d22bbb501cb0287eb264ff
|