A tool for mining GitHub repository metadata and activity data
Project description
Git Miner
Search GitHub repositories and export their data to CSV, JSON, or Parquet.
Installation
pip install git-miner
Quick Start
Search and export repositories:
git-miner search "python web framework" --language python --min-stars 1000
Add your GitHub token for higher rate limits:
export GITHUB_TOKEN=your_token
git-miner search "data science" --format json
Use Cases
- Research: Study open-source trends, language adoption, repository patterns
- Data Science: Build datasets for ML model training on code repositories
- Academic: Analyze collaboration patterns, project lifecycles
- Analytics: Track repository growth, contributor engagement
- Tooling: Find repositories matching specific criteria for automation
- Market Research: Identify popular libraries and frameworks
Commands
Search
git-miner search QUERY [OPTIONS]
Filters:
--language, -l: Programming language--min-stars / --max-stars: Star count range--min-forks / --max-forks: Fork count range--license: License type (e.g., mit, apache-2.0)--topics: Topics to include (comma-separated)--fork / --no-fork: Include or exclude forks--archived / --no-archived: Include or exclude archived repos--sort: Sort by stars, forks, or updated--max-results: Limit number of results
Output:
--format, -f: csv, json, or parquet--output-dir, -o: Output directory
Examples:
# Python repos with 1000+ stars
git-miner search "web framework" --language python --min-stars 1000
# MIT-licensed repos
git-miner search "data science" --license mit
# Exclude forks and archived
git-miner search "api" --no-fork --no-archived
# Top 50 by stars, export to Parquet
git-miner search "machine learning" --sort stars --max-results 50 --format parquet
Extract
Get detailed stats for repositories:
git-miner extract owner/repo
Options:
--activity / --no-activity: Include commit/issue/PR stats--contributors / --no-contributors: Include contributor stats
Output Formats
| Format | Best For |
|---|---|
| CSV | Excel, traditional tools |
| JSON | Web apps, APIs |
| Parquet | Big data, analytics |
Configuration
Create gitminer.toml:
[github]
token = "your_token"
[output]
dir = "./datasets"
format = "parquet"
[api]
max_retries = 3
timeout = 30.0
Use it:
git-miner --config gitminer.toml search "web framework"
Rate Limits
- No token: 60 requests/hour
- With token: 5,000 requests/hour
The tool respects rate limits automatically.
Dataset Fields
Repository Metadata:
- repository_id, name, owner, description
- primary_language, stars, forks, open_issues
- license, created_at, updated_at, pushed_at
- size_kb, url, is_fork, is_archived, topics
Activity Statistics:
- commit_total, commit_additions, commit_deletions
- issues_open, issues_closed, prs_open, prs_closed, prs_merged
Coming Soon
- Direct code extraction from repositories into datasets
- GraphQL API support
- Incremental dataset updates
- Pre-built public datasets
- Plugin-based extractor system
- Cloud storage outputs (S3, GCS)
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Lint and format
make lint
make format
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file git_miner-0.1.6.tar.gz.
File metadata
- Download URL: git_miner-0.1.6.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50f82e73c5cb945929089c6eab0aa2b246aff9dd1c2c326f066c827e003fbe36
|
|
| MD5 |
ab8ac847faab5b22464b1083c3ff0922
|
|
| BLAKE2b-256 |
8b346668ab4ba9b74b56a637c3586bc0e6641565d3e6525b8675368fd5b40374
|
File details
Details for the file git_miner-0.1.6-py3-none-any.whl.
File metadata
- Download URL: git_miner-0.1.6-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a41758963e543046aeeba452c416cba642a8f2da4b2c55bc421eca9584e788f8
|
|
| MD5 |
88b794cd71a7aea03063fc3ae22ceeac
|
|
| BLAKE2b-256 |
6be8dee82a966f2e470f76fa2af8e48fea0e5a9618db52fc9991d6caa05dd076
|