Skip to main content

Mine and extract complete package lists from PyPI registry

Project description

PyPI Package Miner

This tool mines the PyPI (Python Package Index) to collect information about all Python packages.

Features

  • Fetches complete list of PyPI packages from the official simple API
  • Retrieves package metadata including homepage and repository URLs via PyPI JSON API
  • Parallel processing with 50 workers for efficient data collection
  • Intelligently extracts repository URLs from multiple metadata fields
  • Progress tracking with visual feedback
  • Outputs to CSV format compatible with cross-ecosystem analysis

Setup

Run the setup script

chmod +x setup.sh
./setup.sh

This will:

  • Create a virtual environment
  • Install required dependencies (requests, tqdm)

Manual Setup (Alternative)

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Usage

source .venv/bin/activate
python mine_pypi.py

The script will:

  1. Download the complete list of package names from PyPI simple index (~500k packages)
  2. Fetch detailed metadata for each package in parallel
  3. Save results to ../../../Resource/Package/Package-List/PyPI.csv

Output Format

CSV file with columns:

  • ID: Sequential package identifier
  • Platform: "PyPI"
  • Name: Package name
  • Homepage URL: Package homepage URL (from package metadata)
  • Repository URL: Source code repository URL (extracted from project_urls or home_page)

Data Source

Repository URL Detection

The script intelligently searches for repository URLs in the following order:

  1. project_urls field with keys: Source, Source Code, Repository, Code, GitHub, GitLab
  2. home_page field if it contains github.com, gitlab.com, or bitbucket.org

Performance

  • Expected runtime: 3-8 hours for ~500k packages
  • 50 parallel workers for API requests
  • Network-dependent (typically limited by API rate and network speed)

Notes

  • PyPI is continuously updated, so package counts may vary
  • Repository URLs are validated to start with http/https
  • Missing or invalid URLs are marked as "nan"
  • The script handles API errors gracefully and continues processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypi_miner-1.0.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pypi_miner-1.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file pypi_miner-1.0.0.tar.gz.

File metadata

  • Download URL: pypi_miner-1.0.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pypi_miner-1.0.0.tar.gz
Algorithm Hash digest
SHA256 84f21efb0434f231fb97befacef4e6be665bba1f48bc6c970bc1734c7f76caa2
MD5 b004d46da657b88019668d95c07d5744
BLAKE2b-256 44e94dcdcf69ef9e3e43de12e519d446368c64acce08cd1bbcdb033d2dc65c41

See more details on using hashes here.

File details

Details for the file pypi_miner-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pypi_miner-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pypi_miner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc71186dbf0fa735455109b59cce71dd076e131690c0bcef2b4def71d0cc9aa5
MD5 e86f929d6b1dc521648c8ce6c808502a
BLAKE2b-256 d8d1d3b14a9dd068d15bf44df887272d77af2d5175e78b94a87c7b4c3e349d5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page