Skip to main content

Mine and extract complete package lists from Packagist/Composer registry

Project description

PHP/Packagist Miner

A Python tool to mine and extract complete package lists from the Packagist (Composer) registry.

Features

  • Downloads all ~400,000 PHP packages from Packagist.org
  • Fetches package metadata including homepage and repository URLs
  • Rate-limited API calls to respect server resources (20 req/sec)
  • Progress tracking with visual feedback
  • Outputs standardized CSV format for cross-ecosystem analysis

Installation

pip install php-miner

Quick Start

php-miner

Or use as a Python module:

from php_miner import mine_php
mine_php()

Output

Generates a CSV file with package information:

  • Package ID, Platform, Name (vendor/package format)
  • Homepage URL, Repository URL

Performance

  • Runtime: 5-6 hours for complete dataset
  • Rate limit: 20 requests per second
  • Processes ~400,000 packages

Data Source

License

MIT License - see LICENSE file for details

Package Naming Convention

PHP packages follow the vendor/package naming pattern:

  • symfony/console
  • laravel/framework
  • doctrine/orm

This two-part naming helps prevent conflicts and organize packages by maintainer.

Package Metadata Sources

For each package, the script fetches:

{
  "package": {
    "name": "symfony/console",
    "homepage": "https://symfony.com",
    "repository": "https://github.com/symfony/symfony",
    "versions": {
      "dev-master": {
        "source": {
          "url": "https://github.com/symfony/symfony.git",
          "type": "git"
        }
      }
    }
  }
}

The script prioritizes:

  1. Homepage: homepage field → "nan"
  2. Repository: repository field → version source URL → "nan"

Repository URL Extraction

The script tries multiple strategies to find repository URLs:

  1. Direct Repository Field: Uses package.repository if available
  2. Version Source: Checks dev-master, dev-main, master, main branches
  3. First Version: Falls back to first available version's source URL
  4. Validation: Ensures URLs start with http or https

Error Handling

If an API call fails (timeout, 404, etc.):

  • Continues processing with "nan" values
  • Logs no error (fails silently)
  • Ensures complete dataset even with some missing data

Files

  • mine_php.py: Main script
  • requirements.txt: Python dependencies (requests, tqdm)
  • setup.sh: Automated setup script
  • Output: ../../../Resource/Package/Package-List/PHP_New.csv

Troubleshooting

"Error downloading package list"

Check that:

  • You have internet connectivity
  • packagist.org is accessible: curl -I https://packagist.org/packages/list.json
  • No firewall blocking the connection

Script is very slow

This is expected behavior:

  • Rate limiting (20 requests/second) is intentional
  • With 400K+ packages, expect 5-6 hours runtime
  • Consider running overnight or in background

To run in background:

nohup python mine_php.py > output.log 2>&1 &

"Error parsing JSON"

This can occur if:

  • Packagist API response format changed
  • Network corruption during download
  • Server returned error page instead of JSON

Solution: Check internet connection and try again.

"Permission denied" when creating output directory

Ensure you have write permissions to:

  • Current directory (for temporary files)
  • Resource/Package/Package-List/ (for output)

Incomplete data (many "nan" values)

This can occur if:

  • API is temporarily unavailable
  • Network issues during processing
  • Some packages have incomplete metadata

Note: This is normal - not all packages have complete metadata on Packagist.org.

Virtual environment issues

If you encounter errors related to the virtual environment:

  1. Delete the venv folder: rm -rf venv
  2. Re-run the setup script: ./setup.sh
  3. Virtual environments cannot be moved after creation - recreate if you move the directory

Performance Notes

  • Download Time: Fast (package list is relatively small JSON)
  • Processing Time: SLOW (~5-6 hours for 400K+ packages)
  • Memory Usage: Low (processes one package at a time)
  • Network Usage: Moderate (many small API requests)

Optimization Tips

To speed up processing (advanced users):

  1. Reduce delay in time.sleep(0.05) (risks being rate-limited or blocked)
  2. Use parallel requests (requires code modification)
  3. Use Packagist metadata dump if available (check Packagist documentation)

Advantages

  • Complete Data: Includes all public packages
  • Official API: Uses Packagist.org official endpoints
  • Detailed Metadata: Gets homepage and repository URLs
  • Reliable: Gracefully handles API failures
  • Rich Metadata: Packagist provides comprehensive package information

Limitations

  • Slow Processing: Rate limiting means long runtime
  • API Dependent: Requires Packagist.org to be available
  • Incomplete Metadata: Not all packages have homepage/repository info
  • No Version Info: Only captures general package information

Alternative Approaches

For faster processing, consider:

  1. Packagist Dump: Check if Packagist provides database dumps
  2. Metadata Files: Some registries provide metadata files
  3. Cached Data: Use previously downloaded data and update incrementally
  4. Parallel Processing: Use async requests or multiprocessing (advanced)

Code Explanation

Architecture

The PHP Miner uses a two-phase approach:

  1. Phase 1: Download complete list of package names
  2. Phase 2: Fetch detailed metadata for each package

This mirrors the approach used by the Ruby miner, as both ecosystems provide similar API structures.

1. Package List Download

packages_url = "https://packagist.org/packages/list.json"
data = response.json()
package_names = data.get('packageNames', [])

Format: JSON array of package names.

{
  "packageNames": [
    "symfony/console",
    "laravel/framework",
    "guzzlehttp/guzzle",
    ...
  ]
}

Advantages:

  • Fast download (single request)
  • Complete list
  • Simple JSON parsing

2. Package Metadata Fetching

for package_name in package_names:
    time.sleep(0.05)  # Rate limiting (20 req/sec)
    response = requests.get(f"https://packagist.org/packages/{package_name}.json")

Process:

  1. Iterate through each package name
  2. Wait 0.05 seconds (rate limiting)
  3. Fetch JSON metadata
  4. Extract homepage and repository URLs
  5. Write to CSV immediately (streaming write)

3. Complex Repository URL Extraction

# Try direct repository field
repository = package_data.get('repository', '')

# Try version sources
versions = package_data.get('versions', {})
for version_key in ['dev-master', 'dev-main', 'master', 'main']:
    if version_key in versions:
        source = versions[version_key].get('source', {})
        repo_url = source.get('url', '')

Strategy: Multiple fallback levels.

Why Complex: Packagist stores repository info in multiple places:

  • Direct repository field (not always present)
  • Version-specific source URLs (most reliable)
  • Different branch naming conventions (master vs main)

4. Version Priority

for version_key in ['dev-master', 'dev-main', 'master', 'main']:

Priority Order:

  1. dev-master (most common development branch)
  2. dev-main (newer naming convention)
  3. master (tagged version)
  4. main (tagged version)

Fallback: If none found, use first available version.

5. URL Validation

if homepage_url and not homepage_url.startswith('http'):
    homepage_url = "nan"
if repo_url and not repo_url.startswith('http'):
    repo_url = "nan"

Purpose: Filter out invalid URLs.

  • Some packages have placeholder text instead of URLs
  • Ensures data quality
  • "nan" represents missing/invalid data

6. Streaming Write

with open(output_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for package_name in packages:
        # Fetch data
        writer.writerow([...])  # Write immediately

Advantages:

  • Low memory usage (doesn't store all data in memory)
  • Progress saved even if script crashes
  • Can resume partially completed runs (with modification)

Data Quality Notes

Repository URL Accuracy

Packagist repository URLs are generally high quality because:

  • Composer (PHP package manager) requires this information
  • Most packages are hosted on GitHub
  • Package authors maintain metadata actively

Missing Data Patterns

Common reasons for "nan" values:

  1. Abandoned Packages: No longer maintained, incomplete metadata
  2. Private Packages: Listed but not publicly accessible
  3. Vanity URLs: Homepage set to packagist.org page itself
  4. Legacy Packages: Created before Packagist required full metadata

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

php_miner-1.0.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

php_miner-1.0.2-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file php_miner-1.0.2.tar.gz.

File metadata

  • Download URL: php_miner-1.0.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.2.tar.gz
Algorithm Hash digest
SHA256 a9bc349e21706751abd8ebe739f5fe37bbc84cb3ca9a6eb59c1bd4123d8ae701
MD5 7e35b1e1688b128ffc58c24b40f7b9fb
BLAKE2b-256 4690d084d6f2d3a13dee3f8029f3f4ce925bbe90e5af58dbe3a028b69dd5e6b8

See more details on using hashes here.

File details

Details for the file php_miner-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: php_miner-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2be57a3ffe62c8e91fec84986599d84e801c8d9bffa09fa98454661dde732919
MD5 65dbf4a9020e1bc7d5aae3e571898f0e
BLAKE2b-256 158c00a0a88d48b5323da0de7664b163bd05d6e72ca8370acc495861055cc26a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page