Mine and extract complete package lists from RubyGems registry
Project description
Ruby Gems Miner
This tool downloads and processes the RubyGems.org package list to extract Ruby gem information for cross-ecosystem analysis.
Features
- Downloads gem list from RubyGems.org
- Fetches detailed metadata via RubyGems API
- Extracts package metadata (ID, name, homepage, repository)
- Formats data for cross-ecosystem package analysis
- Progress tracking with visual feedback
- Rate-limited API calls to respect server resources
- Generates standardized CSV output compatible with Package-Filter
Setup
Run the setup script
chmod +x setup.sh
./setup.sh
This will:
- Create a virtual environment
- Install required dependencies (requests, tqdm)
- Prepare the environment for mining
Manual Setup (Alternative)
If you prefer to set up manually:
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Important: Virtual environments contain hardcoded paths and cannot be moved after creation. If you need to relocate this script:
- Delete the
venvfolder - Recreate it in the new location
- Reinstall the packages
Usage
Mine all Ruby gems from RubyGems.org:
source venv/bin/activate
python mine_ruby.py
Or run directly without activating:
venv/bin/python mine_ruby.py
The script will:
- Download the compact specs file (specs.4.8.gz)
- Download the list of all gem names from RubyGems.org
- Fetch detailed information for each gem via API
- Generate CSV output in
Resource/Package/Package-List/Ruby_New.csv - Clean up temporary files
Data Sources
- Gem Names: http://rubygems.org/names
- Gem Details: https://rubygems.org/api/v1/gems/{name}.json
- Format: Plain text list (names) and JSON (details)
Output Format
The script generates Ruby_New.csv in the Resource/Package/Package-List/ directory with the following structure:
ID,Platform,Name,Homepage URL,Repository URL
1,RubyGems,rails,https://rubyonrails.org,https://github.com/rails/rails
2,RubyGems,devise,https://github.com/heartcombo/devise,https://github.com/heartcombo/devise
3,RubyGems,rake,https://github.com/ruby/rake,https://github.com/ruby/rake
Column Descriptions
- ID: Sequential identifier (1, 2, 3, ...)
- Platform: Always "RubyGems" for Ruby packages
- Name: Gem name as registered on RubyGems.org
- Homepage URL: Project homepage (from gem metadata)
- Repository URL: Source code repository URL
Note: This format is compatible with the Package-Filter tool for cross-ecosystem analysis.
Processing Details
API Rate Limiting
The script implements rate limiting to avoid overwhelming the RubyGems API:
- Rate: 10 requests per second (0.1 second delay between requests)
- Purpose: Respectful API usage, avoiding server load
- Impact: Processing time increases with number of gems
Estimated Time: With ~180,000 gems and 10 req/sec, expect ~5 hours total runtime.
Gem Metadata Sources
For each gem, the script fetches:
{
"name": "rails",
"homepage_uri": "https://rubyonrails.org",
"source_code_uri": "https://github.com/rails/rails",
"project_uri": "https://rubygems.org/gems/rails"
}
The script prioritizes:
- Homepage:
homepage_uri→project_uri→ "nan" - Repository:
source_code_uri→homepage_uri→ "nan"
Error Handling
If an API call fails (timeout, 404, etc.):
- Continues processing with "nan" values
- Logs no error (fails silently)
- Ensures complete dataset even with some missing data
Files
mine_ruby.py: Main scriptrequirements.txt: Python dependencies (requests, tqdm)setup.sh: Automated setup scriptspecs.4.8.gz: Temporary download file (deleted after use)specs.4.8: Temporary decompressed file (deleted after use)- Output:
../../../Resource/Package/Package-List/Ruby_New.csv
Troubleshooting
"Error downloading gem names"
Check that:
- You have internet connectivity
- rubygems.org is accessible:
curl -I http://rubygems.org/names - No firewall blocking the connection
Script is very slow
This is expected behavior:
- Rate limiting (10 requests/second) is intentional
- With 180K+ gems, expect 5+ hours runtime
- Consider running overnight or in background
To run in background:
nohup python mine_ruby.py > output.log 2>&1 &
"Permission denied" when creating output directory
Ensure you have write permissions to:
- Current directory (for temporary files)
Resource/Package/Package-List/(for output)
Incomplete data (many "nan" values)
This can occur if:
- API is temporarily unavailable
- Network issues during processing
- Some gems have incomplete metadata
Note: This is normal - not all gems have complete metadata on RubyGems.org.
Virtual environment issues
If you encounter errors related to the virtual environment:
- Delete the
venvfolder:rm -rf venv - Re-run the setup script:
./setup.sh - Virtual environments cannot be moved after creation - recreate if you move the directory
Performance Notes
- Download Time: Fast (gem names list is small)
- Processing Time: SLOW (~5 hours for 180K+ gems)
- Memory Usage: Low (processes one gem at a time)
- Network Usage: Moderate (many small API requests)
Optimization Tips
To speed up processing (advanced users):
- Reduce delay in
time.sleep(0.1)(risks being rate-limited or blocked) - Use parallel requests (requires code modification)
- Use RubyGems database dump instead of API (requires parsing Marshal format)
Advantages
- Complete Data: Includes all public gems
- Official API: Uses RubyGems.org official endpoints
- Detailed Metadata: Gets homepage and repository URLs
- Reliable: Gracefully handles API failures
Limitations
- Slow Processing: Rate limiting means long runtime
- API Dependent: Requires RubyGems.org to be available
- Incomplete Metadata: Not all gems have homepage/repository info
- No Version Info: Only captures latest gem information
Alternative Approaches
For faster processing, consider:
- Database Dump: RubyGems provides database dumps (requires PostgreSQL)
- Bulk API: Some bulk endpoints may exist (check RubyGems API docs)
- Cached Data: Use previously downloaded data and update incrementally
Code Explanation
Architecture
The Ruby Miner uses a two-phase approach:
- Phase 1: Download complete list of gem names
- Phase 2: Fetch detailed metadata for each gem
This is necessary because RubyGems doesn't provide a single complete dump like crates.io.
1. Specs File Download
specs_url = "https://rubygems.org/specs.4.8.gz"
download_file(specs_url, specs_gz_path)
Purpose: Downloads compact specs (currently not parsed, but available for future use).
Note: The specs file is in Ruby Marshal format (binary), which is complex to parse in Python. Currently, we use the simpler names endpoint instead.
2. Gem Names List
names_url = "http://rubygems.org/names"
gem_names = response.text.strip().split('\n')
Format: Simple newline-delimited text file.
rails
devise
rake
...
Advantages:
- Simple to parse
- Complete list of all gems
- Fast download
3. Detailed Metadata Fetching
for gem_name in gem_names:
time.sleep(0.1) # Rate limiting
response = requests.get(f"https://rubygems.org/api/v1/gems/{gem_name}.json")
Process:
- Iterate through each gem name
- Wait 0.1 seconds (rate limiting)
- Fetch JSON metadata
- Extract homepage and repository URLs
- Write to CSV immediately (streaming write)
4. URL Extraction Priority
homepage_url = gem_info.get('homepage_uri', '') or gem_info.get('project_uri', '') or "nan"
repo_url = gem_info.get('source_code_uri', '') or gem_info.get('homepage_uri', '') or "nan"
Fallback Chain:
- Homepage: Try
homepage_urifirst, fall back toproject_uri - Repository: Try
source_code_urifirst, fall back tohomepage_uri
Validation:
if homepage_url and not homepage_url.startswith('http'):
homepage_url = "nan"
Ensures only valid HTTP/HTTPS URLs are kept.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ruby_miner-1.0.1.tar.gz.
File metadata
- Download URL: ruby_miner-1.0.1.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c67bd57ad85bd0f81bdd7223fcf21c69841dd99081ef856affd991321f415abc
|
|
| MD5 |
a718b1ca0021ea876c1282c12aa655b4
|
|
| BLAKE2b-256 |
e3e70c4aa3187372c4d5bfbd68001d26cc10e0c7071df59ab99ddb3d0b296ab7
|
File details
Details for the file ruby_miner-1.0.1-py3-none-any.whl.
File metadata
- Download URL: ruby_miner-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a432fd8cce1c20491c0016cf49af70c47332c9e580f1bc66225cf88cb55997c1
|
|
| MD5 |
4e2469e8b05a5e54e7d59e00f74542d6
|
|
| BLAKE2b-256 |
ea85e3e303980fbe69bb5328c00eca511593af6909200a85ac920ed82f324d78
|