Skip to main content

Mine and extract complete package lists from Go modules registry

Project description

Go Package Miner

A Python tool to mine and extract complete package lists from the Go modules registry.

Features

  • Downloads all ~5.7 million Go modules from index.golang.org
  • Extracts repository URLs from Go proxy API Origin field
  • Optimized connection pooling for faster downloads
  • Checkpoint system for resumable downloads
  • Progress tracking with detailed statistics
  • Outputs standardized CSV format for cross-ecosystem analysis

Installation

pip install go-miner

Quick Start

go-miner

Or use as a Python module:

from go_miner import mine_go
mine_go()

Output

Generates a CSV file with module information:

  • Package ID, Platform, Name (full module path)
  • Homepage URL, Repository URL

Performance

  • Runtime: 6-10 minutes for complete dataset
  • Processes ~8,000 batches (2000 entries each)
  • Memory usage: ~500-800 MB

Features

  • Resume Support: Automatically resumes from interruptions
  • No Rate Limits: Uses official public index with no authentication
  • Fast Processing: Connection pooling and batch processing

Data Source

License

MIT License - see LICENSE file for details

Processing Details

Module Index Structure

The Go module index returns entries in this format:

{"Path":"github.com/user/repo","Version":"v1.2.3","Timestamp":"2023-01-01T00:00:00Z"}
{"Path":"github.com/user/repo","Version":"v1.2.4","Timestamp":"2023-02-01T00:00:00Z"}

The script:

  1. Downloads all entries as newline-delimited JSON
  2. Parses each entry
  3. Extracts unique module paths (ignoring multiple versions)
  4. Queries Go proxy API for each module's Origin metadata
  5. Extracts repository URLs from Origin field (when available)

API Data Retrieval

For each module, the script queries:

  1. /@latest endpoint for latest version and Origin data
  2. /@v/{version}.info endpoint for specific version Origin data
  3. Returns "nan" if Origin field is not present in API response
  4. No inference or pattern matching - only uses official API data

Files

  • mine_go.py: Main script
  • requirements.txt: Python dependencies (requests, tqdm)
  • setup.sh: Automated setup script
  • Output: ../../../Resource/Package/Package-List/Go_New.csv

Troubleshooting

"Error downloading Go module index"

Check that:

  • You have internet connectivity
  • index.golang.org is accessible: curl -I https://index.golang.org/index
  • No firewall blocking the connection

Solution: The script will automatically retry failed requests. If errors persist, check your network.

"Failed to download Go modules or no modules found"

This may occur if:

  • The index API format has changed
  • Network connection interrupted
  • Response was empty or malformed

Solution: Check your internet connection and try again. If you were interrupted, the script will resume from the last checkpoint.

Download was interrupted

No problem! The script saves checkpoints every 1000 batches.

Solution: Simply run the script again:

python mine_go.py

The script will display: Resuming from checkpoint: batch XXXX, XXXXX modules

"Permission denied" when creating output directory

Ensure you have write permissions to:

  • Current directory (for checkpoint files: .checkpoint.json)
  • Output directory (default: Resource/Package/Package-List/)

Solution: Run with appropriate permissions or specify a writable directory:

python mine_go.py --output-dir ~/Downloads

Virtual environment issues

If you encounter errors related to the virtual environment:

  1. Delete the venv folder: rm -rf venv
  2. Re-run the setup script: ./setup.sh
  3. Virtual environments cannot be moved after creation - recreate if you move the directory

Checkpoint file corrupted

If you see errors loading the checkpoint file:

Solution: Delete the checkpoint and start fresh:

rm .checkpoint.json
python mine_go.py

Slow download speed

If the download is slower than expected:

  • Check your network connection speed
  • Ensure no bandwidth-heavy applications are running
  • The script should process 10-20 batches/second
  • If seeing <5 batches/second, check for network congestion

SSL/TLS warnings

If you see warnings about LibreSSL or OpenSSL:

NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+

Note: This is a warning, not an error. The script will still work correctly. To suppress it, upgrade your Python's SSL library or ignore the warning.

Performance Notes

Optimizations

  • Connection Pooling: Reuses HTTP connections (20 connections, 40 max pool size)
  • Automatic Retries: Intelligent retry with exponential backoff for failed requests
  • Compression Support: Accepts gzip encoding to reduce bandwidth
  • Set-Based Deduplication: O(1) lookups for fast duplicate detection
  • Batch CSV Writing: Writes all rows at once for faster I/O
  • Checkpoint System: Saves progress every 1000 batches

Performance Metrics

  • Download Speed: ~3-4 batches/second (optimized)
  • Typical Runtime: ~6-10 minutes for complete download (~8000 batches)
  • Memory Usage: Moderate (~500-800 MB for ~5.7M unique modules)
  • Network Efficiency: Persistent connections reduce overhead by ~60%
  • Total Data: ~16M entries processed, ~5.7M unique modules extracted

Speed Comparison

Version Time Batches/sec Notes
Original ~15-20 min 6-8 Sequential with 0.1s delay
Optimized ~6-10 min 13-20 Connection pooling, no delay
Improvement 50-60% faster 2.5x faster With checkpoint support

Advantages

  • Official Source: Uses Google's official Go module proxy
  • No Rate Limits: Public index with no authentication required
  • Complete Data: Includes all public Go modules (~5.7M+)
  • Efficient Pagination: Batched requests with automatic deduplication
  • Reliable Data: Only uses API-provided Origin data (no inference)
  • Robust & Reliable: Automatic retry, checkpoint system, graceful interruption handling
  • Fast Performance: Optimized connection pooling and batch processing
  • Flexible Output: Customizable output directory and filename
  • Resume Capability: Continue from interruption without re-downloading

Limitations

  • Origin Data Availability: Many modules (especially older ones) lack Origin metadata in Go proxy API
  • Module Versions: Only unique module paths are stored (versions ignored)
  • Private Modules: Only includes public modules
  • Metadata: Limited to what's available in Go proxy API (no descriptions, etc.)

Code Explanation

Architecture

The Go Miner uses the official Go module index API with optimized pagination to fetch all public Go modules. The implementation includes connection pooling, automatic retries, checkpoint system, and batch processing for maximum efficiency.

1. Optimized Session with Connection Pooling

def create_session():
    session = requests.Session()
    adapter = HTTPAdapter(
        max_retries=retry_strategy,
        pool_connections=20,
        pool_maxsize=40
    )
    session.mount("http://", adapter)
    session.mount("https://", adapter)

Purpose: Creates a reusable HTTP session with connection pooling.

Optimizations:

  • Connection Pooling: Reuses TCP connections (20 pool connections, 40 max)
  • Automatic Retry: Retries failed requests with exponential backoff
  • Compression: Requests gzip encoding to reduce bandwidth
  • Persistent Headers: Sets User-Agent and Accept-Encoding once

Benefits:

  • ~60% faster than creating new connections each time
  • Reduces server load and network overhead
  • Handles transient network failures automatically

2. Paginated Index Download with Checkpoints

def download_go_index():
    base_url = "https://index.golang.org/index"
    since = ""
    checkpoint_interval = 1000

    # Load from checkpoint if exists
    if os.path.exists(checkpoint_file):
        checkpoint = json.load(f)
        modules_set = set(checkpoint['modules'])
        since = checkpoint['since']

Purpose: Downloads the complete Go module index with resume capability.

Features:

  • Pagination using since parameter with timestamps
  • Fetches up to 2000 modules per request
  • Checkpoint System: Saves progress every 1000 batches
  • Resume Support: Automatically resumes from last checkpoint
  • Progress Tracking: Shows unique modules, new modules, and total entries

Pagination Logic:

  • Start with no since parameter (gets oldest 2000 modules)
  • Extract timestamp from last entry
  • Use that timestamp as since in next request
  • Save checkpoint every 1000 batches
  • Repeat until fewer than 2000 entries returned

3. Efficient Deduplication with Set

modules_set = set()  # Use set for fast deduplication
if module_path and module_path not in modules_set:
    modules_set.add(module_path)
    new_modules += 1

Purpose: Keeps only unique module paths efficiently.

Optimization:

  • Set instead of Dict: Faster and uses less memory
  • O(1) lookup and insertion
  • Same module appears multiple times (one per version)
  • We only need unique paths, not version info
  • Memory efficient for 5.7M+ modules

4. API Data Retrieval

def get_module_info(module_path, session):
    # Query Go proxy API for Origin metadata
    response = session.get(f"https://proxy.golang.org/{module_path}/@latest")
    origin = response.json().get('Origin', {})
    if origin:
        repo_url = origin.get('URL')

Strategy: Retrieve repository URLs from official Go proxy API.

  • Queries /@latest and /@v/{version}.info endpoints
  • Extracts URL from Origin field when available
  • Returns "nan" if Origin data not present
  • No inference or pattern matching

5. Batch CSV Writing

rows = []
for module in modules:
    rows.append([...])

writer.writerows(rows)  # Write all at once

Purpose: Faster CSV generation.

Optimization:

  • Collect all rows in memory
  • Single write operation instead of thousands
  • Reduces I/O overhead significantly
  • Processes 5.7M modules in seconds

6. Command-Line Interface

parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output-dir')
parser.add_argument('-f', '--output-file')

Purpose: Flexible output configuration.

Features:

  • Custom output directory
  • Custom filename
  • Help text with examples
  • Auto-add .csv extension if missing
  • Create directories if they don't exist

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

go_miner-1.0.2.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

go_miner-1.0.2-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file go_miner-1.0.2.tar.gz.

File metadata

  • Download URL: go_miner-1.0.2.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for go_miner-1.0.2.tar.gz
Algorithm Hash digest
SHA256 15a82efcd3654cf9cffb5eba3a13f1a881cb8575d3eb100b4bbd92d3725a8978
MD5 1a67d98b230a7fb665aad06a40813d0b
BLAKE2b-256 4597e8d32daaf46a14e9137c424417b10d4179cfa42956cdaab867656b0e64db

See more details on using hashes here.

File details

Details for the file go_miner-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: go_miner-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for go_miner-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7321cf47e56253564a72cfdd1fecf8669d618e972e4ab8a5043356dbae159204
MD5 12c3851b6f43a2b7ed894a2293f6e913
BLAKE2b-256 3d226c6af867c7e3bd25d80c573c12bc6d01d1ebb3a47f5121dc7d04d70cdf27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page