A stream-processing tool for GitHub Archive data filtering.
Project description
gharc: GitHub Archive Stream-Processor
Mine the GitHub Archive on a standard laptop.
gharc is a command-line tool and Python library that filters the GitHub Archive dataset on consumer hardware. Each hourly archive is streamed through memory, filtered against your criteria, and written out as Parquet or JSONL. Peak local storage stays bounded by a single in-flight download (about 150 MB) regardless of how long a window you process.
Why gharc?
The full GitHub Archive dataset exceeds petabytes in size. Traditional analysis requires either massive local storage or a cloud-warehouse account (BigQuery, Snowflake).
gharc solves this by implementing a Stream-and-Filter architecture:
- Streaming: Downloads each hourly archive (~60 to 150 MB compressed in 2024) to a temporary file.
- Filtering: Extracts only events matching your criteria (e.g., specific repos or event types).
- Writing: Streams matching events into a single Parquet or JSONL file via
pyarrow.ParquetWriterfor true append. - Cleanup: Deletes the temporary download immediately after, so disk usage never accumulates.
Ideal for:
- Academic research on Open Source Software (OSS).
- Large scale data mining on consumer hardware.
- Creating custom datasets for specific organizations or ecosystems.
Key Features
- Zero-Storage Overhead: Processes terabytes of data with a constant disk footprint of <100MB.
- Resumable Downloads: Smart handling of network interruptions (common with residential internet) using HTTP Range requests.
- High Performance:
- Parallel processing with thread pools.
- Optimized "Fast String Check" (zero-copy filtering) to skip irrelevant data.
- Optional
orjsonsupport for 3-5x faster parsing.
- Parquet Native: Outputs columnar data ready for Pandas, Spark, or Polars, often reducing file size by 90% compared to JSON.
Performance
Measured on a Windows 11 laptop (12 logical cores, 15 GB RAM) over a typical residential connection. Reproducible scripts in benchmarks/.
A six-hour window of GHArchive (2024-01-01 00:00 to 06:00 UTC), filtered to apache/spark:
| Workers | Wall-clock | Hours/sec | Spark events | Peak RSS |
|---|---|---|---|---|
| 1 | 76.0 s | 0.079 | 14 | 94.2 MB |
| 4 | 58.1 s | 0.103 | 14 | 106.7 MB |
Both runs recovered the same events, so concurrency does not affect output. Peak RSS stays below 110 MB. The bottleneck on residential links is HTTPS download throughput rather than CPU; additional workers help up to a point and then saturate the connection.
The same six-hour window comprises about 1.2 GB of compressed source on the GHArchive side, while the filtered Parquet output is 53 KB. That is a storage saving of roughly 22,000 to 1, and at no point does peak local disk exceed the size of a single in-flight temporary file (about 150 MB).
Installation
Prerequisites
- Python 3.10 or higher
pip
Install from PyPI
pip install gharc
Install from Source
git clone https://github.com/aravpanwar/gharc.git
cd gharc
python -m venv venv
# macOS / Linux:
source venv/bin/activate
# Windows PowerShell:
# .\venv\Scripts\Activate.ps1
pip install -e .
Optional Performance Boost
For maximum speed, install with the fast extra. gharc detects and uses orjson automatically when available.
pip install "gharc[fast]"
Usage
Basic Command
Download all activity for a specific repository over a one-day window.
Note that --end is exclusive, so this covers all 24 hours of 2024-01-01.
gharc download \
--start 2024-01-01 \
--end 2024-01-02 \
--repos "apache/spark" \
--output spark_data.parquet
For multi-hour or multi-day runs, prefer --output run.jsonl so the run can resume from where it left off if it crashes; convert to Parquet at the end with gharc convert run.jsonl run.parquet. See Resumable runs below for details.
Advanced Filtering
Filter for multiple repositories and specific event types (e.g., only Pull Requests and Pushes). This covers all of June 2023 (June 1 inclusive through July 1 exclusive).
gharc download \
--start 2023-06-01 \
--end 2023-07-01 \
--repos "apache/spark, pandas-dev/pandas, pytorch/pytorch" \
--event-types "PullRequestEvent, PushEvent" \
--output oss_summer_2023.parquet \
--workers 4
Arguments
| Argument | Description | Example |
|---|---|---|
--start |
Start date, inclusive (YYYY-MM-DD or YYYY-MM-DD-HH) | 2024-01-01 |
--end |
End date, exclusive (YYYY-MM-DD or YYYY-MM-DD-HH) | 2024-02-01 |
--repos |
Comma-separated list of repositories to keep | apache/spark,tensorflow/tensorflow |
--event-types |
Comma-separated list of GHArchive event types | WatchEvent,ForkEvent |
--output |
Output filename (.parquet or .jsonl) |
data.parquet |
--workers |
Number of parallel download threads (default: 4) | 8 |
Resumable runs
For long jobs, gharc keeps a small <output>.state.json next to the output file listing which hours it has already processed. If the run crashes, restarting the same command picks up where it left off rather than redoing completed hours. The state file is removed automatically when the run finishes cleanly.
Resume support requires JSONL output. Parquet writers cannot append to a closed file, so for multi-hour runs use --output run.jsonl and convert to Parquet at the end:
gharc convert run.jsonl run.parquet
Python API
The CLI is a thin wrapper around gharc.process_range, which you can call directly:
from datetime import datetime
import gharc
gharc.setup_logging()
gharc.process_range(
start=datetime(2024, 1, 1),
end=datetime(2024, 1, 2),
repos=["apache/spark"],
event_types=None,
output="spark_one_day.jsonl",
workers=4,
)
gharc.jsonl_to_parquet("spark_one_day.jsonl", "spark_one_day.parquet")
__all__ in gharc/__init__.py lists the public surface (process_range, jsonl_to_parquet, DataWriter, parse_date, date_range, get_url_for_time, setup_logging, plus the filter helpers).
Automating Bulk Downloads
For long date ranges, the included examples/orchestrator.py script runs gharc month by month so each year produces one Parquet file per month rather than one giant output:
python examples/orchestrator.py \
--start 2023-01-01 \
--end 2024-01-01 \
--repos "apache/spark,pandas-dev/pandas" \
--output-dir ./gharc_out \
--workers 4
Repository Layout
gharc/
├── src/gharc/ # Library + CLI entry point
├── tests/ # pytest test suite
├── benchmarks/ # Reproducible runs that back the performance claims
├── examples/ # Driver scripts (e.g. month-by-month orchestrator)
├── paper/ # paper.md, paper.bib, figures (the JOSS submission)
└── CITATION.cff # GitHub-detectable citation metadata
Contributing
Contributions are welcome. Please read CONTRIBUTING.md for details on the process for submitting pull requests.
Running Tests:
pip install -e ".[test]"
pytest tests/
Citation
The accompanying paper is at paper/paper.pdf and is rebuilt automatically on every push by the Paper CI workflow.
If you use gharc in your research, please cite it using the metadata in CITATION.cff or as follows:
@software{gharc2026,
author = {Panwar, Arav},
title = {gharc: A stream-and-filter tool for the GitHub Archive on consumer hardware},
year = {2026},
url = {https://github.com/aravpanwar/gharc}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Created by Arav Panwar aravpanwar.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gharc-0.1.2.tar.gz.
File metadata
- Download URL: gharc-0.1.2.tar.gz
- Upload date:
- Size: 285.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64a6179667c596ed4bec46d51ee57f8e325c9e3ed04e16793d9c3864336c4a32
|
|
| MD5 |
ac8a4728310a02783b619b2dc3f836bf
|
|
| BLAKE2b-256 |
8297bf3faad6b0595fce07a4f89050c832037a3474a7db8640d3dc91ee6fdc4e
|
Provenance
The following attestation bundles were made for gharc-0.1.2.tar.gz:
Publisher:
release.yml on aravpanwar/gharc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gharc-0.1.2.tar.gz -
Subject digest:
64a6179667c596ed4bec46d51ee57f8e325c9e3ed04e16793d9c3864336c4a32 - Sigstore transparency entry: 1437736767
- Sigstore integration time:
-
Permalink:
aravpanwar/gharc@bd7ac98237286bd2c4fed9195bee9c7dc06ab514 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/aravpanwar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bd7ac98237286bd2c4fed9195bee9c7dc06ab514 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gharc-0.1.2-py3-none-any.whl.
File metadata
- Download URL: gharc-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7445cb1bbffe52a6e1c2adbe4679cffbba697128d94abfc1caec74f865173cf3
|
|
| MD5 |
c834b6f488cbf6fa3f288f4c1cdfb66f
|
|
| BLAKE2b-256 |
553754b0998350e3fdae63c0bfc4e27194b96ae7772a711822ff54be5f240a1c
|
Provenance
The following attestation bundles were made for gharc-0.1.2-py3-none-any.whl:
Publisher:
release.yml on aravpanwar/gharc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gharc-0.1.2-py3-none-any.whl -
Subject digest:
7445cb1bbffe52a6e1c2adbe4679cffbba697128d94abfc1caec74f865173cf3 - Sigstore transparency entry: 1437736777
- Sigstore integration time:
-
Permalink:
aravpanwar/gharc@bd7ac98237286bd2c4fed9195bee9c7dc06ab514 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/aravpanwar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bd7ac98237286bd2c4fed9195bee9c7dc06ab514 -
Trigger Event:
release
-
Statement type: