Mine and extract complete package lists from PyPI registry
Project description
PyPI Package Miner
This tool mines the PyPI (Python Package Index) to collect information about all Python packages.
Features
- Fetches complete list of PyPI packages from the official simple API
- Retrieves package metadata including homepage and repository URLs via PyPI JSON API
- Parallel processing with 50 workers for efficient data collection
- Intelligently extracts repository URLs from multiple metadata fields
- Progress tracking with visual feedback
- Outputs to CSV format compatible with cross-ecosystem analysis
Setup
Run the setup script
chmod +x setup.sh
./setup.sh
This will:
- Create a virtual environment
- Install required dependencies (requests, tqdm)
Manual Setup (Alternative)
# Create virtual environment
python3 -m venv .venv
# Activate virtual environment
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Usage
source .venv/bin/activate
python mine_pypi.py
The script will:
- Download the complete list of package names from PyPI simple index (~500k packages)
- Fetch detailed metadata for each package in parallel
- Save results to
../../../Resource/Package/Package-List/PyPI.csv
Output Format
CSV file with columns:
ID: Sequential package identifierPlatform: "PyPI"Name: Package nameHomepage URL: Package homepage URL (from package metadata)Repository URL: Source code repository URL (extracted from project_urls or home_page)
Data Source
- Simple Index: https://pypi.org/simple/
- Package metadata: https://pypi.org/pypi/{package-name}/json
Repository URL Detection
The script intelligently searches for repository URLs in the following order:
project_urlsfield with keys: Source, Source Code, Repository, Code, GitHub, GitLabhome_pagefield if it contains github.com, gitlab.com, or bitbucket.org
Performance
- Expected runtime: 3-8 hours for ~500k packages
- 50 parallel workers for API requests
- Network-dependent (typically limited by API rate and network speed)
Notes
- PyPI is continuously updated, so package counts may vary
- Repository URLs are validated to start with http/https
- Missing or invalid URLs are marked as "nan"
- The script handles API errors gracefully and continues processing
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pypi_miner-1.0.1.tar.gz
(7.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pypi_miner-1.0.1.tar.gz.
File metadata
- Download URL: pypi_miner-1.0.1.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acb29759a00b6379ce4910040b47536c9b160cddc1d003ca7f29b59f57668ce1
|
|
| MD5 |
f37ff6ae6e95cb8e85e7dce1a87ecac4
|
|
| BLAKE2b-256 |
d335a3a477d72dcffecc44ba51c79bf3143ca072739fb3ece2a49b3aa250d07a
|
File details
Details for the file pypi_miner-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pypi_miner-1.0.1-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45d984cfb22c8bef8550cbaeeb9862228ed7ebe09b843ed501b4116c4a498518
|
|
| MD5 |
9ce6225c8e43c919e72fd59283dceced
|
|
| BLAKE2b-256 |
95eec1791dcb33afd261d8f1ca22d881d54abf1a8369717e8efa1b86a82359d6
|