Skip to main content

A tool for mining GitHub repository metadata and activity data

Project description

Git Miner

Search GitHub repositories and export their data to CSV, JSON, or Parquet.

PyPI version PyPI downloads License

Installation

pip install git-miner

Quick Start

Search and export repositories:

git-miner search "python web framework" --language python --min-stars 1000

Add your GitHub token for higher rate limits:

Option 1: Using the auth command (recommended)

git-miner auth add --token your_token

Your token is stored locally in ~/.cache/git-miner/tokens.db and automatically used.

Option 2: Environment variable

export GITHUB_TOKEN=your_token
git-miner search "data science" --format json

Option 3: Command line flag

git-miner search "data science" --token your_token

Manage cached tokens

# List stored tokens
git-miner auth list

# Remove a token
git-miner auth remove

# Show a specific token
git-miner auth show --name default

Use Cases

  • Research: Study open-source trends, language adoption, repository patterns
  • Data Science: Build datasets for ML model training on code repositories
  • Academic: Analyze collaboration patterns, project lifecycles
  • Analytics: Track repository growth, contributor engagement
  • Tooling: Find repositories matching specific criteria for automation
  • Market Research: Identify popular libraries and frameworks

Commands

Auth

Manage GitHub tokens locally:

git-miner auth list|add|remove|show [OPTIONS]

Actions:

  • list: Show all cached tokens
  • add: Store a new token
  • remove: Delete a token
  • show: Display a token value

Options:

  • --name, -n: Token name (default: "default")
  • --token, -t: Token value (required for add)

Search

git-miner search QUERY [OPTIONS]

Filters:

  • --language, -l: Programming language
  • --min-stars / --max-stars: Star count range
  • --min-forks / --max-forks: Fork count range
  • --license: License type (e.g., mit, apache-2.0)
  • --topics: Topics to include (comma-separated)
  • --fork / --no-fork: Include or exclude forks
  • --archived / --no-archived: Include or exclude archived repos
  • --sort: Sort by stars, forks, or updated
  • --max-results: Limit number of results

Saved Searches:

  • --save, -s <name>: Save search with given name
  • --force, -f: Overwrite existing saved search

Output:

  • --format, -f: csv, json, or parquet
  • --output-dir, -o: Output directory

Examples:

# Python repos with 1000+ stars
git-miner search "web framework" --language python --min-stars 1000

# MIT-licensed repos
git-miner search "data science" --license mit

# Exclude forks and archived
git-miner search "api" --no-fork --no-archived

# Top 50 by stars, export to Parquet
git-miner search "machine learning" --sort stars --max-results 50 --format parquet

Extract

Get detailed stats for repositories:

git-miner extract owner/repo

Options:

  • --activity / --no-activity: Include commit/issue/PR stats
  • --contributors / --no-contributors: Include contributor stats

Searches

Manage saved search queries:

git-miner searches list|run|delete|show [OPTIONS]

Actions:

  • list: Show all saved searches
  • run <name>: Execute a saved search
  • delete <name>: Delete a saved search
  • show <name>: Show details of a saved search

Examples:

# Save a search with a name
git-miner search "machine learning" --language python --min-stars 1000 --save ml-repos

# List all saved searches
git-miner searches list

# Run a saved search
git-miner searches run ml-repos

# Show search details
git-miner searches show ml-repos

# Delete a saved search
git-miner searches delete ml-repos

Saved searches are stored in ~/.cache/git-miner/state.db and include the query string along with all filter options (language, stars, forks, license, topics, etc.).

Output Formats

Format Best For
CSV Excel, traditional tools
JSON Web apps, APIs
Parquet Big data, analytics

Configuration

Create gitminer.toml:

[output]
dir = "./datasets"
format = "parquet"

[api]
max_retries = 3
timeout = 30.0

Use it:

git-miner --config gitminer.toml search "web framework"

Note: GitHub tokens are now managed via the auth command and stored in a local SQLite cache.

Rate Limits

  • No token: 60 requests/hour
  • With token: 5,000 requests/hour

The tool respects rate limits automatically.

Dataset Fields

Repository Metadata:

  • repository_id, name, owner, description
  • primary_language, stars, forks, open_issues
  • license, created_at, updated_at, pushed_at
  • size_kb, url, is_fork, is_archived, topics

Activity Statistics:

  • commit_total, commit_additions, commit_deletions
  • issues_open, issues_closed, prs_open, prs_closed, prs_merged

Coming Soon

  • Direct code extraction from repositories into datasets
  • GraphQL API support
  • Incremental dataset updates
  • Pre-built public datasets
  • Plugin-based extractor system
  • Cloud storage outputs (S3, GCS)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint and format
make lint
make format

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_miner-0.2.6.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_miner-0.2.6-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file git_miner-0.2.6.tar.gz.

File metadata

  • Download URL: git_miner-0.2.6.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for git_miner-0.2.6.tar.gz
Algorithm Hash digest
SHA256 90b2bc6f6f620569207d512a30969b587949641298eb7a31f59b4c62b7e8521d
MD5 9022dc24c63e6043ca3e3caaf84e8242
BLAKE2b-256 ff0d924e0a43c472fc1885ce47ff8264883bf039bf75d47268d36ce13ad7fd63

See more details on using hashes here.

File details

Details for the file git_miner-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: git_miner-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for git_miner-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 16d4f8e85cad8d2b6265ff587b27cdca80258c616c1f7c76784ec01030228443
MD5 64d42734cbb800de3af68c6d336d3766
BLAKE2b-256 cdcc345fc651d09a6da9d08e6338e4341b78a3e575c7f601a92329aa53f26ef6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page