Mine and extract complete package lists from RubyGems registry
Project description
Ruby Gems Miner
A Python tool to mine and extract complete package lists from the RubyGems registry.
Features
- Downloads all ~180,000 Ruby gems from RubyGems.org
- Fetches package metadata including homepage and repository URLs
- Rate-limited API calls to respect server resources (10 req/sec)
- Progress tracking with visual feedback
- Outputs standardized CSV format for cross-ecosystem analysis
Installation
pip install ruby-miner
Quick Start
ruby-miner
Or use as a Python module:
from ruby_miner import mine_ruby
mine_ruby()
Output
Generates a CSV file with gem information:
- Package ID, Platform, Name
- Homepage URL, Repository URL
Performance
- Runtime: ~5 hours for complete dataset
- Rate limit: 10 requests per second
- Processes ~180,000 gems
Data Source
- Gem Names: http://rubygems.org/names
- Gem Details: https://rubygems.org/api/v1/gems/{name}.json
License
MIT License - see LICENSE file for details
Gem Metadata Sources
For each gem, the script fetches:
{
"name": "rails",
"homepage_uri": "https://rubyonrails.org",
"source_code_uri": "https://github.com/rails/rails",
"project_uri": "https://rubygems.org/gems/rails"
}
The script prioritizes:
- Homepage:
homepage_uri→project_uri→ "nan" - Repository:
source_code_uri→homepage_uri→ "nan"
Error Handling
If an API call fails (timeout, 404, etc.):
- Continues processing with "nan" values
- Logs no error (fails silently)
- Ensures complete dataset even with some missing data
Files
mine_ruby.py: Main scriptrequirements.txt: Python dependencies (requests, tqdm)setup.sh: Automated setup scriptspecs.4.8.gz: Temporary download file (deleted after use)specs.4.8: Temporary decompressed file (deleted after use)- Output:
../../../Resource/Package/Package-List/Ruby_New.csv
Troubleshooting
"Error downloading gem names"
Check that:
- You have internet connectivity
- rubygems.org is accessible:
curl -I http://rubygems.org/names - No firewall blocking the connection
Script is very slow
This is expected behavior:
- Rate limiting (10 requests/second) is intentional
- With 180K+ gems, expect 5+ hours runtime
- Consider running overnight or in background
To run in background:
nohup python mine_ruby.py > output.log 2>&1 &
"Permission denied" when creating output directory
Ensure you have write permissions to:
- Current directory (for temporary files)
Resource/Package/Package-List/(for output)
Incomplete data (many "nan" values)
This can occur if:
- API is temporarily unavailable
- Network issues during processing
- Some gems have incomplete metadata
Note: This is normal - not all gems have complete metadata on RubyGems.org.
Virtual environment issues
If you encounter errors related to the virtual environment:
- Delete the
venvfolder:rm -rf venv - Re-run the setup script:
./setup.sh - Virtual environments cannot be moved after creation - recreate if you move the directory
Performance Notes
- Download Time: Fast (gem names list is small)
- Processing Time: SLOW (~5 hours for 180K+ gems)
- Memory Usage: Low (processes one gem at a time)
- Network Usage: Moderate (many small API requests)
Optimization Tips
To speed up processing (advanced users):
- Reduce delay in
time.sleep(0.1)(risks being rate-limited or blocked) - Use parallel requests (requires code modification)
- Use RubyGems database dump instead of API (requires parsing Marshal format)
Advantages
- Complete Data: Includes all public gems
- Official API: Uses RubyGems.org official endpoints
- Detailed Metadata: Gets homepage and repository URLs
- Reliable: Gracefully handles API failures
Limitations
- Slow Processing: Rate limiting means long runtime
- API Dependent: Requires RubyGems.org to be available
- Incomplete Metadata: Not all gems have homepage/repository info
- No Version Info: Only captures latest gem information
Alternative Approaches
For faster processing, consider:
- Database Dump: RubyGems provides database dumps (requires PostgreSQL)
- Bulk API: Some bulk endpoints may exist (check RubyGems API docs)
- Cached Data: Use previously downloaded data and update incrementally
Code Explanation
Architecture
The Ruby Miner uses a two-phase approach:
- Phase 1: Download complete list of gem names
- Phase 2: Fetch detailed metadata for each gem
This is necessary because RubyGems doesn't provide a single complete dump like crates.io.
1. Specs File Download
specs_url = "https://rubygems.org/specs.4.8.gz"
download_file(specs_url, specs_gz_path)
Purpose: Downloads compact specs (currently not parsed, but available for future use).
Note: The specs file is in Ruby Marshal format (binary), which is complex to parse in Python. Currently, we use the simpler names endpoint instead.
2. Gem Names List
names_url = "http://rubygems.org/names"
gem_names = response.text.strip().split('\n')
Format: Simple newline-delimited text file.
rails
devise
rake
...
Advantages:
- Simple to parse
- Complete list of all gems
- Fast download
3. Detailed Metadata Fetching
for gem_name in gem_names:
time.sleep(0.1) # Rate limiting
response = requests.get(f"https://rubygems.org/api/v1/gems/{gem_name}.json")
Process:
- Iterate through each gem name
- Wait 0.1 seconds (rate limiting)
- Fetch JSON metadata
- Extract homepage and repository URLs
- Write to CSV immediately (streaming write)
4. URL Extraction Priority
homepage_url = gem_info.get('homepage_uri', '') or gem_info.get('project_uri', '') or "nan"
repo_url = gem_info.get('source_code_uri', '') or gem_info.get('homepage_uri', '') or "nan"
Fallback Chain:
- Homepage: Try
homepage_urifirst, fall back toproject_uri - Repository: Try
source_code_urifirst, fall back tohomepage_uri
Validation:
if homepage_url and not homepage_url.startswith('http'):
homepage_url = "nan"
Ensures only valid HTTP/HTTPS URLs are kept.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ruby_miner-1.0.2.tar.gz.
File metadata
- Download URL: ruby_miner-1.0.2.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
988b7cee4dc105cb1487355d6e3b727766b3e4917edcede99ae84eda6d69b28f
|
|
| MD5 |
14607e803e1fd80010846eeb368d47bb
|
|
| BLAKE2b-256 |
f7a0864da0d60e54d89b3b758a8b7740b4cf102e68b90765419cd5fd15b0b688
|
File details
Details for the file ruby_miner-1.0.2-py3-none-any.whl.
File metadata
- Download URL: ruby_miner-1.0.2-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1436d7ad9d098b21daee48aa88819014072fd7af23b94f23afef0dd673c4d58a
|
|
| MD5 |
1cebafac0bce17ed8376c1a5ac236a37
|
|
| BLAKE2b-256 |
d58c4aab98ab8e51280a11534216269e95c5dc3670de97ba3a4b48efc3e01e17
|