Skip to main content

Mine and extract complete package lists from RubyGems registry

Project description

Ruby Gems Miner

A Python tool to mine and extract complete package lists from the RubyGems registry.

Features

  • Downloads all ~180,000 Ruby gems from RubyGems.org
  • Fetches package metadata including homepage and repository URLs
  • Rate-limited API calls to respect server resources (10 req/sec)
  • Progress tracking with visual feedback
  • Outputs standardized CSV format for cross-ecosystem analysis

Installation

pip install ruby-miner

Quick Start

ruby-miner

Or use as a Python module:

from ruby_miner import mine_ruby
mine_ruby()

Output

Generates a CSV file with gem information:

  • Package ID, Platform, Name
  • Homepage URL, Repository URL

Performance

  • Runtime: ~5 hours for complete dataset
  • Rate limit: 10 requests per second
  • Processes ~180,000 gems

Data Source

License

MIT License - see LICENSE file for details

Gem Metadata Sources

For each gem, the script fetches:

{
  "name": "rails",
  "homepage_uri": "https://rubyonrails.org",
  "source_code_uri": "https://github.com/rails/rails",
  "project_uri": "https://rubygems.org/gems/rails"
}

The script prioritizes:

  1. Homepage: homepage_uriproject_uri → "nan"
  2. Repository: source_code_urihomepage_uri → "nan"

Error Handling

If an API call fails (timeout, 404, etc.):

  • Continues processing with "nan" values
  • Logs no error (fails silently)
  • Ensures complete dataset even with some missing data

Files

  • mine_ruby.py: Main script
  • requirements.txt: Python dependencies (requests, tqdm)
  • setup.sh: Automated setup script
  • specs.4.8.gz: Temporary download file (deleted after use)
  • specs.4.8: Temporary decompressed file (deleted after use)
  • Output: ../../../Resource/Package/Package-List/Ruby_New.csv

Troubleshooting

"Error downloading gem names"

Check that:

  • You have internet connectivity
  • rubygems.org is accessible: curl -I http://rubygems.org/names
  • No firewall blocking the connection

Script is very slow

This is expected behavior:

  • Rate limiting (10 requests/second) is intentional
  • With 180K+ gems, expect 5+ hours runtime
  • Consider running overnight or in background

To run in background:

nohup python mine_ruby.py > output.log 2>&1 &

"Permission denied" when creating output directory

Ensure you have write permissions to:

  • Current directory (for temporary files)
  • Resource/Package/Package-List/ (for output)

Incomplete data (many "nan" values)

This can occur if:

  • API is temporarily unavailable
  • Network issues during processing
  • Some gems have incomplete metadata

Note: This is normal - not all gems have complete metadata on RubyGems.org.

Virtual environment issues

If you encounter errors related to the virtual environment:

  1. Delete the venv folder: rm -rf venv
  2. Re-run the setup script: ./setup.sh
  3. Virtual environments cannot be moved after creation - recreate if you move the directory

Performance Notes

  • Download Time: Fast (gem names list is small)
  • Processing Time: SLOW (~5 hours for 180K+ gems)
  • Memory Usage: Low (processes one gem at a time)
  • Network Usage: Moderate (many small API requests)

Optimization Tips

To speed up processing (advanced users):

  1. Reduce delay in time.sleep(0.1) (risks being rate-limited or blocked)
  2. Use parallel requests (requires code modification)
  3. Use RubyGems database dump instead of API (requires parsing Marshal format)

Advantages

  • Complete Data: Includes all public gems
  • Official API: Uses RubyGems.org official endpoints
  • Detailed Metadata: Gets homepage and repository URLs
  • Reliable: Gracefully handles API failures

Limitations

  • Slow Processing: Rate limiting means long runtime
  • API Dependent: Requires RubyGems.org to be available
  • Incomplete Metadata: Not all gems have homepage/repository info
  • No Version Info: Only captures latest gem information

Alternative Approaches

For faster processing, consider:

  1. Database Dump: RubyGems provides database dumps (requires PostgreSQL)
  2. Bulk API: Some bulk endpoints may exist (check RubyGems API docs)
  3. Cached Data: Use previously downloaded data and update incrementally

Code Explanation

Architecture

The Ruby Miner uses a two-phase approach:

  1. Phase 1: Download complete list of gem names
  2. Phase 2: Fetch detailed metadata for each gem

This is necessary because RubyGems doesn't provide a single complete dump like crates.io.

1. Specs File Download

specs_url = "https://rubygems.org/specs.4.8.gz"
download_file(specs_url, specs_gz_path)

Purpose: Downloads compact specs (currently not parsed, but available for future use).

Note: The specs file is in Ruby Marshal format (binary), which is complex to parse in Python. Currently, we use the simpler names endpoint instead.

2. Gem Names List

names_url = "http://rubygems.org/names"
gem_names = response.text.strip().split('\n')

Format: Simple newline-delimited text file.

rails
devise
rake
...

Advantages:

  • Simple to parse
  • Complete list of all gems
  • Fast download

3. Detailed Metadata Fetching

for gem_name in gem_names:
    time.sleep(0.1)  # Rate limiting
    response = requests.get(f"https://rubygems.org/api/v1/gems/{gem_name}.json")

Process:

  1. Iterate through each gem name
  2. Wait 0.1 seconds (rate limiting)
  3. Fetch JSON metadata
  4. Extract homepage and repository URLs
  5. Write to CSV immediately (streaming write)

4. URL Extraction Priority

homepage_url = gem_info.get('homepage_uri', '') or gem_info.get('project_uri', '') or "nan"
repo_url = gem_info.get('source_code_uri', '') or gem_info.get('homepage_uri', '') or "nan"

Fallback Chain:

  • Homepage: Try homepage_uri first, fall back to project_uri
  • Repository: Try source_code_uri first, fall back to homepage_uri

Validation:

if homepage_url and not homepage_url.startswith('http'):
    homepage_url = "nan"

Ensures only valid HTTP/HTTPS URLs are kept.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruby_miner-1.0.2.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ruby_miner-1.0.2-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file ruby_miner-1.0.2.tar.gz.

File metadata

  • Download URL: ruby_miner-1.0.2.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ruby_miner-1.0.2.tar.gz
Algorithm Hash digest
SHA256 988b7cee4dc105cb1487355d6e3b727766b3e4917edcede99ae84eda6d69b28f
MD5 14607e803e1fd80010846eeb368d47bb
BLAKE2b-256 f7a0864da0d60e54d89b3b758a8b7740b4cf102e68b90765419cd5fd15b0b688

See more details on using hashes here.

File details

Details for the file ruby_miner-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ruby_miner-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ruby_miner-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1436d7ad9d098b21daee48aa88819014072fd7af23b94f23afef0dd673c4d58a
MD5 1cebafac0bce17ed8376c1a5ac236a37
BLAKE2b-256 d58c4aab98ab8e51280a11534216269e95c5dc3670de97ba3a4b48efc3e01e17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page