Mine and extract complete package lists from Packagist/Composer registry

These details have not been verified by PyPI

Project links

Project description

PHP/Packagist Miner

A Python tool to mine and extract complete package lists from the Packagist (Composer) registry.

Features

Downloads all ~400,000 PHP packages from Packagist.org
Fetches package metadata including homepage and repository URLs
Rate-limited API calls to respect server resources (20 req/sec)
Progress tracking with visual feedback
Outputs standardized CSV format for cross-ecosystem analysis

Installation

pip install php-miner

Quick Start

php-miner

Or use as a Python module:

from php_miner import mine_php
mine_php()

Output

Generates a CSV file with package information:

Package ID, Platform, Name (vendor/package format)
Homepage URL, Repository URL

Performance

Runtime: 5-6 hours for complete dataset
Rate limit: 20 requests per second
Processes ~400,000 packages

Data Source

Packagist Package List: https://packagist.org/packages/list.json
Package Details: https://packagist.org/packages/{vendor}/{package}.json

License

MIT License - see LICENSE file for details

Package Naming Convention

PHP packages follow the vendor/package naming pattern:

symfony/console
laravel/framework
doctrine/orm

This two-part naming helps prevent conflicts and organize packages by maintainer.

Package Metadata Sources

For each package, the script fetches:

{
  "package": {
    "name": "symfony/console",
    "homepage": "https://symfony.com",
    "repository": "https://github.com/symfony/symfony",
    "versions": {
      "dev-master": {
        "source": {
          "url": "https://github.com/symfony/symfony.git",
          "type": "git"
        }
      }
    }
  }
}

The script prioritizes:

Homepage: homepage field → "nan"
Repository: repository field → version source URL → "nan"

Repository URL Extraction

The script tries multiple strategies to find repository URLs:

Direct Repository Field: Uses package.repository if available
Version Source: Checks dev-master, dev-main, master, main branches
First Version: Falls back to first available version's source URL
Validation: Ensures URLs start with http or https

Error Handling

If an API call fails (timeout, 404, etc.):

Continues processing with "nan" values
Logs no error (fails silently)
Ensures complete dataset even with some missing data

Files

mine_php.py: Main script
requirements.txt: Python dependencies (requests, tqdm)
setup.sh: Automated setup script
Output: ../../../Resource/Package/Package-List/PHP_New.csv

Troubleshooting

"Error downloading package list"

Check that:

You have internet connectivity
packagist.org is accessible: curl -I https://packagist.org/packages/list.json
No firewall blocking the connection

Script is very slow

This is expected behavior:

Rate limiting (20 requests/second) is intentional
With 400K+ packages, expect 5-6 hours runtime
Consider running overnight or in background

To run in background:

nohup python mine_php.py > output.log 2>&1 &

"Error parsing JSON"

This can occur if:

Packagist API response format changed
Network corruption during download
Server returned error page instead of JSON

Solution: Check internet connection and try again.

"Permission denied" when creating output directory

Ensure you have write permissions to:

Current directory (for temporary files)
Resource/Package/Package-List/ (for output)

Incomplete data (many "nan" values)

This can occur if:

API is temporarily unavailable
Network issues during processing
Some packages have incomplete metadata

Note: This is normal - not all packages have complete metadata on Packagist.org.

Virtual environment issues

If you encounter errors related to the virtual environment:

Delete the venv folder: rm -rf venv
Re-run the setup script: ./setup.sh
Virtual environments cannot be moved after creation - recreate if you move the directory

Performance Notes

Download Time: Fast (package list is relatively small JSON)
Processing Time: SLOW (~5-6 hours for 400K+ packages)
Memory Usage: Low (processes one package at a time)
Network Usage: Moderate (many small API requests)

Optimization Tips

To speed up processing (advanced users):

Reduce delay in time.sleep(0.05) (risks being rate-limited or blocked)
Use parallel requests (requires code modification)
Use Packagist metadata dump if available (check Packagist documentation)

Advantages

Complete Data: Includes all public packages
Official API: Uses Packagist.org official endpoints
Detailed Metadata: Gets homepage and repository URLs
Reliable: Gracefully handles API failures
Rich Metadata: Packagist provides comprehensive package information

Limitations

Slow Processing: Rate limiting means long runtime
API Dependent: Requires Packagist.org to be available
Incomplete Metadata: Not all packages have homepage/repository info
No Version Info: Only captures general package information

Alternative Approaches

For faster processing, consider:

Packagist Dump: Check if Packagist provides database dumps
Metadata Files: Some registries provide metadata files
Cached Data: Use previously downloaded data and update incrementally
Parallel Processing: Use async requests or multiprocessing (advanced)

Code Explanation

Architecture

The PHP Miner uses a two-phase approach:

Phase 1: Download complete list of package names
Phase 2: Fetch detailed metadata for each package

This mirrors the approach used by the Ruby miner, as both ecosystems provide similar API structures.

1. Package List Download

packages_url = "https://packagist.org/packages/list.json"
data = response.json()
package_names = data.get('packageNames', [])

Format: JSON array of package names.

{
  "packageNames": [
    "symfony/console",
    "laravel/framework",
    "guzzlehttp/guzzle",
    ...
  ]
}

Advantages:

Fast download (single request)
Complete list
Simple JSON parsing

2. Package Metadata Fetching

for package_name in package_names:
    time.sleep(0.05)  # Rate limiting (20 req/sec)
    response = requests.get(f"https://packagist.org/packages/{package_name}.json")

Process:

Iterate through each package name
Wait 0.05 seconds (rate limiting)
Fetch JSON metadata
Extract homepage and repository URLs
Write to CSV immediately (streaming write)

3. Complex Repository URL Extraction

# Try direct repository field
repository = package_data.get('repository', '')

# Try version sources
versions = package_data.get('versions', {})
for version_key in ['dev-master', 'dev-main', 'master', 'main']:
    if version_key in versions:
        source = versions[version_key].get('source', {})
        repo_url = source.get('url', '')

Strategy: Multiple fallback levels.

Why Complex: Packagist stores repository info in multiple places:

Direct repository field (not always present)
Version-specific source URLs (most reliable)
Different branch naming conventions (master vs main)

4. Version Priority

for version_key in ['dev-master', 'dev-main', 'master', 'main']:

Priority Order:

dev-master (most common development branch)
dev-main (newer naming convention)
master (tagged version)
main (tagged version)

Fallback: If none found, use first available version.

5. URL Validation

if homepage_url and not homepage_url.startswith('http'):
    homepage_url = "nan"
if repo_url and not repo_url.startswith('http'):
    repo_url = "nan"

Purpose: Filter out invalid URLs.

Some packages have placeholder text instead of URLs
Ensures data quality
"nan" represents missing/invalid data

6. Streaming Write

with open(output_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for package_name in packages:
        # Fetch data
        writer.writerow([...])  # Write immediately

Advantages:

Low memory usage (doesn't store all data in memory)
Progress saved even if script crashes
Can resume partially completed runs (with modification)

Data Quality Notes

Repository URL Accuracy

Packagist repository URLs are generally high quality because:

Composer (PHP package manager) requires this information
Most packages are hosted on GitHub
Package authors maintain metadata actively

Missing Data Patterns

Common reasons for "nan" values:

Abandoned Packages: No longer maintained, incomplete metadata
Private Packages: Listed but not publicly accessible
Vanity URLs: Homepage set to packagist.org page itself
Legacy Packages: Created before Packagist required full metadata

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Jan 19, 2026

1.0.3

Jan 19, 2026

This version

1.0.2

Jan 18, 2026

1.0.1

Jan 18, 2026

1.0.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

php_miner-1.0.2.tar.gz (13.7 kB view details)

Uploaded Jan 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

php_miner-1.0.2-py3-none-any.whl (10.6 kB view details)

Uploaded Jan 18, 2026 Python 3

File details

Details for the file php_miner-1.0.2.tar.gz.

File metadata

Download URL: php_miner-1.0.2.tar.gz
Upload date: Jan 18, 2026
Size: 13.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a9bc349e21706751abd8ebe739f5fe37bbc84cb3ca9a6eb59c1bd4123d8ae701`
MD5	`7e35b1e1688b128ffc58c24b40f7b9fb`
BLAKE2b-256	`4690d084d6f2d3a13dee3f8029f3f4ce925bbe90e5af58dbe3a028b69dd5e6b8`

See more details on using hashes here.

File details

Details for the file php_miner-1.0.2-py3-none-any.whl.

File metadata

Download URL: php_miner-1.0.2-py3-none-any.whl
Upload date: Jan 18, 2026
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2be57a3ffe62c8e91fec84986599d84e801c8d9bffa09fa98454661dde732919`
MD5	`65dbf4a9020e1bc7d5aae3e571898f0e`
BLAKE2b-256	`158c00a0a88d48b5323da0de7664b163bd05d6e72ca8370acc495861055cc26a`

See more details on using hashes here.

php-miner 1.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PHP/Packagist Miner

Features

Installation

Quick Start

Output

Performance

Data Source

License

Package Naming Convention

Package Metadata Sources

Repository URL Extraction

Error Handling

Files

Troubleshooting

"Error downloading package list"

Script is very slow

"Error parsing JSON"

"Permission denied" when creating output directory

Incomplete data (many "nan" values)

Virtual environment issues

Performance Notes

Optimization Tips

Advantages

Limitations

Alternative Approaches

Code Explanation

Architecture

1. Package List Download

2. Package Metadata Fetching

3. Complex Repository URL Extraction

4. Version Priority

5. URL Validation

6. Streaming Write

Data Quality Notes

Repository URL Accuracy

Missing Data Patterns

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes