Mine and extract complete package lists from crates.io registry
Project description
Crates.io Package Miner
A Python tool to mine and extract complete package lists from the crates.io (Rust) registry.
Features
- Downloads official crates.io database dump (~1000 MB)
- Extracts package metadata including homepage and repository URLs
- Progress tracking with visual feedback
- Automatic cleanup of temporary files
- Outputs standardized CSV format for cross-ecosystem analysis
Installation
pip install crates-miner
Quick Start
crates-miner
Or use as a Python module:
from crates_miner import mine_crates
mine_crates()
Output
Generates a CSV file with crate information:
- Package ID, Platform, Name
- Homepage URL, Repository URL
Performance
- Runtime: Fast (depends on download speed)
- No API rate limits (uses database dump)
- Processes ~100,000+ crates
- Download size: ~1000 MB compressed
Data Source
- Crates.io Database Dump: https://static.crates.io/db-dump.tar.gz
License
MIT License - see LICENSE file for details
Processing Details
Database Dump Structure
The downloaded archive contains a dated directory (e.g., 2025-11-03-020107) with:
2025-11-03-020107/
├── data/
│ ├── crates.csv ← Used by this script
│ ├── versions.csv
│ ├── dependencies.csv
│ ├── teams.csv
│ └── ... (other files)
└── metadata/
└── ... (metadata files)
Extraction Process
- Download: Fetches
db-dump.tar.gzfrom static.crates.io - Extract: Decompresses to
Code/Script/Crates-Miner/crates-db/ - Process: Reads
data/crates.csvfrom the dated directory - Transform: Converts to standardized format
- Output: Writes to
Resource/Package/Package-List/Crates_New.csv - Cleanup: Deletes the
.tar.gzfile (keeps extracted data)
Data Transformation
The script transforms crates.io data to match the cross-ecosystem format:
Input (crates.csv):
id,name,homepage,repository,description,downloads,...
12345,serde,https://serde.rs,https://github.com/serde-rs/serde,"A serialization framework",50000000,...
Output (Crates_New.csv):
ID,Platform,Name,Homepage URL,Repository URL
1,Crates.io,serde,https://serde.rs,https://github.com/serde-rs/serde
Files
mine_crates.py: Main scriptrequirements.txt: Python dependencies (requests, pandas, tqdm)setup.sh: Automated setup scriptcrates-db/: Temporary directory for extracted database (created during execution)- Output:
../../../Resource/Package/Package-List/Crates_New.csv
Troubleshooting
"Could not download database dump"
Check that:
- You have internet connectivity
- crates.io is accessible:
curl -I https://static.crates.io/db-dump.tar.gz - You have sufficient disk space (~500 MB for extraction)
"crates.csv not found"
This may occur if:
- The database dump structure has changed
- Extraction failed partway through
- The archive is corrupted
Solution: Delete crates-db/ directory and run again to re-download.
"Permission denied" when creating output directory
Ensure you have write permissions to:
- Current directory (for temporary files)
Resource/Package/Package-List/(for output)
"Memory error" during processing
The crates.io database is large (100K+ crates). If you encounter memory issues:
- Close other applications
- Increase available system memory
- Consider processing in chunks (requires script modification)
Virtual environment issues
If you encounter errors related to the virtual environment:
- Delete the
venvfolder:rm -rf venv - Re-run the setup script:
./setup.sh - Virtual environments cannot be moved after creation - recreate if you move the directory
Code Explanation
Architecture Overview
The Crates.io Miner is a simpler tool compared to the Directory Structure Miners, focused on a single task: downloading and processing the official crates.io database dump.
Key characteristics:
- Batch Processing: Downloads entire database at once
- Official Source: Uses crates.io's public database dump
- No API Calls: Works with static data dump (no rate limiting)
- Large Scale: Processes 100K+ packages
1. Download Function
def download_file(url, filename):
"""Downloads a file from a URL with a progress bar."""
response = requests.get(url, stream=True)
response.raise_for_status()
total_size = int(response.headers.get("content-length", 0))
block_size = 1024 # 1 Kilobyte
with open(filename, "wb") as f, tqdm(
desc=filename,
total=total_size,
unit="iB",
unit_scale=True,
unit_divisor=1024,
) as bar:
for data in response.iter_content(block_size):
bar.update(len(data))
f.write(data)
Purpose: Downloads large files with progress tracking.
Features:
- Streaming: Uses
stream=Trueto avoid loading entire file in memory - Progress Bar: Shows download progress with
tqdm - Chunk Processing: Downloads in 1KB chunks
- Size Display: Shows human-readable units (MB, GB)
Process:
- Make HTTP GET request with streaming
- Get total file size from headers
- Open file in binary write mode
- Download and write in chunks
- Update progress bar after each chunk
2. Main Mining Function
def mine_crates():
"""Mines crates.io to get the whole list of Rust packages from the database dump."""
dump_url = "https://static.crates.io/db-dump.tar.gz"
dump_path = "db-dump.tar.gz"
extract_path = "Code/Script/Crates-Miner/crates-db"
Purpose: Orchestrates the entire mining process.
Configuration:
- dump_url: Official crates.io database dump URL
- dump_path: Local filename for downloaded archive
- extract_path: Directory for extracted files
3. Download Phase
# Download the database dump
if not os.path.exists(dump_path):
print("Downloading crates.io database dump...")
download_file(dump_url, dump_path)
else:
print("Database dump already downloaded.")
Logic:
- Checks if archive already exists
- Skips download if file present (saves time on reruns)
- Downloads ~100-200 MB compressed file
Why This Matters:
- Avoids re-downloading large file unnecessarily
- Useful during development/testing
- Saves bandwidth and time
4. Extraction Phase
# Extract the database dump
print("Extracting database dump...")
with tarfile.open(dump_path, "r:gz") as tar:
tar.extractall(path=extract_path)
# Delete the tar.gz file
if os.path.exists(dump_path):
print("Deleting database dump archive...")
os.remove(dump_path)
Process:
- Open Archive: Opens .tar.gz file in read mode
- Extract All: Extracts to
crates-db/directory - Cleanup: Removes archive to save disk space
Tar Format: r:gz means read mode with gzip compression
5. Directory Discovery
# Find the actual data directory (it has a date in the name)
data_dir = ""
for item in os.listdir(extract_path):
if os.path.isdir(os.path.join(extract_path, item)):
data_dir = os.path.join(extract_path, item)
break
if not data_dir:
print("Could not find data directory in the extracted archive.")
return
Purpose: Finds the dated directory containing the actual data.
Why Dynamic: The database dump directory name changes with each snapshot:
2025-11-03-020107/2025-11-04-020107/- etc.
Logic:
- List all items in extraction path
- Find first directory (not file)
- Use that as data directory
- Error if no directory found
6. CSV Path Construction
crates_csv_path = os.path.join(data_dir, "data", "crates.csv")
if not os.path.exists(crates_csv_path):
print(f"crates.csv not found in {data_dir}")
return
Path Structure: crates-db/{date}/data/crates.csv
Validation: Checks file exists before attempting to read
7. Data Processing
print("Processing crate data...")
# Read the crates data
df = pd.read_csv(crates_csv_path)
# Create the path to the output file
output_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'Resource', 'Package', 'Package-List'))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_file = os.path.join(output_dir, "Crates_New.csv")
Reading:
- Uses pandas
read_csv()for efficient processing - Automatically handles CSV parsing and data types
Output Path:
- Navigates up 3 directories from script location
- Ensures output directory exists (
makedirs) - Constructs full path to output file
Path Navigation:
Current: Code/Script/Crates-Miner/mine_crates.py
Up 1: Code/Script/Crates-Miner/
Up 2: Code/Script/
Up 3: Code/
Result: Code/../../../Resource/Package/Package-List/
8. CSV Writing
with open(output_file, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["ID", "Platform", "Name", "Homepage URL", "Repository URL"])
for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Writing to CSV"):
writer.writerow([
index + 1,
"Crates.io",
row["name"],
row["homepage"],
row["repository"],
])
print(f"Successfully saved {df.shape[0]} crates to {output_file}")
Process:
- Open File: In write mode with UTF-8 encoding
- Write Header: Column names for standardized format
- Iterate Rows: Loop through pandas DataFrame with progress bar
- Transform Data: Convert each row to standard format
- Write Row: Add to output CSV
Data Transformation:
- ID: Uses index + 1 (1-based instead of 0-based)
- Platform: Hardcoded as "Crates.io"
- Name: Direct mapping from
row["name"] - Homepage URL: Direct mapping from
row["homepage"] - Repository URL: Direct mapping from
row["repository"]
Progress Tracking:
tqdm()withtotal=df.shape[0]shows percentage complete- Useful for large datasets (100K+ rows)
- Provides time estimates
Workflow Summary
┌─────────────────────────────────────────────────────────────┐
│ Crates.io Miner Workflow │
└─────────────────────────────────────────────────────────────┘
1. Check if db-dump.tar.gz exists
├─ Yes → Skip download
└─ No → Download from static.crates.io (~200 MB)
2. Extract db-dump.tar.gz
├─ Decompress with gzip
├─ Extract tar to crates-db/
└─ Delete archive file
3. Find dated directory
├─ Scan crates-db/
└─ Locate {date}/data/crates.csv
4. Load data with pandas
└─ Read entire crates.csv into DataFrame
5. Create output directory structure
└─ Resource/Package/Package-List/
6. Transform and write CSV
├─ Header: ID, Platform, Name, Homepage URL, Repository URL
├─ For each crate:
│ ├─ Generate sequential ID
│ ├─ Set Platform = "Crates.io"
│ ├─ Copy name, homepage, repository
│ └─ Write row
└─ Show progress bar
7. Complete
└─ Print success message with count
Error Handling
The script includes basic error handling for common scenarios:
Download Errors:
response.raise_for_status()
- Raises exception for HTTP errors (404, 500, etc.)
- Stops execution if download fails
File Not Found:
if not os.path.exists(crates_csv_path):
print(f"crates.csv not found in {data_dir}")
return
- Checks for expected files
- Gracefully exits with error message
Directory Creation:
if not os.path.exists(output_dir):
os.makedirs(output_dir)
- Creates output directory if missing
- Prevents write errors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crates_miner-1.0.2.tar.gz.
File metadata
- Download URL: crates_miner-1.0.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
190da68ddb6cb210be53df64ce1435c91df189bbce9b80bc644e01928e094561
|
|
| MD5 |
5f16d1bde43ab1aabf4aa71a0040c0e4
|
|
| BLAKE2b-256 |
ee3078423acdae662182e35fc37b3a2239e913c62646deb73d1943e24f88adf5
|
File details
Details for the file crates_miner-1.0.2-py3-none-any.whl.
File metadata
- Download URL: crates_miner-1.0.2-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b884d3b04f6b29128b965417d734aba00aabe35be4db577593a362414a1b0a5d
|
|
| MD5 |
bdbe73342cc29cb789ca30427ee6bdb5
|
|
| BLAKE2b-256 |
0dba27e2aaa1635a2b9ce9fda41b378b1517a0712f33a11b935590827624806a
|