Skip to main content

Mine and extract complete package lists from PyPI registry

Project description

PyPI Package Miner

This tool mines the PyPI (Python Package Index) to collect information about all Python packages.

Features

  • Fetches complete list of PyPI packages from the official simple API
  • Retrieves package metadata including homepage and repository URLs via PyPI JSON API
  • Parallel processing with 50 workers for efficient data collection
  • Intelligently extracts repository URLs from multiple metadata fields
  • Progress tracking with visual feedback
  • Outputs to CSV format compatible with cross-ecosystem analysis

Setup

Run the setup script

chmod +x setup.sh
./setup.sh

This will:

  • Create a virtual environment
  • Install required dependencies (requests, tqdm)

Manual Setup (Alternative)

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Usage

source .venv/bin/activate
python mine_pypi.py

The script will:

  1. Download the complete list of package names from PyPI simple index (~500k packages)
  2. Fetch detailed metadata for each package in parallel
  3. Save results to ../../../Resource/Package/Package-List/PyPI.csv

Output Format

CSV file with columns:

  • ID: Sequential package identifier
  • Platform: "PyPI"
  • Name: Package name
  • Homepage URL: Package homepage URL (from package metadata)
  • Repository URL: Source code repository URL (extracted from project_urls or home_page)

Data Source

Repository URL Detection

The script intelligently searches for repository URLs in the following order:

  1. project_urls field with keys: Source, Source Code, Repository, Code, GitHub, GitLab
  2. home_page field if it contains github.com, gitlab.com, or bitbucket.org

Performance

  • Expected runtime: 3-8 hours for ~500k packages
  • 50 parallel workers for API requests
  • Network-dependent (typically limited by API rate and network speed)

Notes

  • PyPI is continuously updated, so package counts may vary
  • Repository URLs are validated to start with http/https
  • Missing or invalid URLs are marked as "nan"
  • The script handles API errors gracefully and continues processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypi_miner-1.0.1.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pypi_miner-1.0.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file pypi_miner-1.0.1.tar.gz.

File metadata

  • Download URL: pypi_miner-1.0.1.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pypi_miner-1.0.1.tar.gz
Algorithm Hash digest
SHA256 acb29759a00b6379ce4910040b47536c9b160cddc1d003ca7f29b59f57668ce1
MD5 f37ff6ae6e95cb8e85e7dce1a87ecac4
BLAKE2b-256 d335a3a477d72dcffecc44ba51c79bf3143ca072739fb3ece2a49b3aa250d07a

See more details on using hashes here.

File details

Details for the file pypi_miner-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pypi_miner-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pypi_miner-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 45d984cfb22c8bef8550cbaeeb9862228ed7ebe09b843ed501b4116c4a498518
MD5 9ce6225c8e43c919e72fd59283dceced
BLAKE2b-256 95eec1791dcb33afd261d8f1ca22d881d54abf1a8369717e8efa1b86a82359d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page