Mine and extract complete package lists from Packagist/Composer registry

These details have not been verified by PyPI

Project links

Project description

PHP/Packagist Miner

This tool downloads and processes the Packagist.org package list to extract PHP package information for cross-ecosystem analysis.

Features

Downloads package list from Packagist.org
Fetches detailed metadata via Packagist API
Extracts package metadata (ID, name, homepage, repository)
Formats data for cross-ecosystem package analysis
Progress tracking with visual feedback
Rate-limited API calls to respect server resources
Generates standardized CSV output compatible with Package-Filter

Setup

Run the setup script

chmod +x setup.sh
./setup.sh

This will:

Create a virtual environment
Install required dependencies (requests, tqdm)
Prepare the environment for mining

Manual Setup (Alternative)

If you prefer to set up manually:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Important: Virtual environments contain hardcoded paths and cannot be moved after creation. If you need to relocate this script:

Delete the venv folder
Recreate it in the new location
Reinstall the packages

Usage

Mine all PHP packages from Packagist.org:

source venv/bin/activate
python mine_php.py

Or run directly without activating:

venv/bin/python mine_php.py

The script will:

Download the list of all package names from Packagist.org
Fetch detailed information for each package via API
Generate CSV output in Resource/Package/Package-List/PHP_New.csv

Data Sources

Package Names: https://packagist.org/packages/list.json
Package Details: https://packagist.org/packages/{vendor}/{package}.json
Format: JSON

Output Format

The script generates PHP_New.csv in the Resource/Package/Package-List/ directory with the following structure:

ID,Platform,Name,Homepage URL,Repository URL
1,Packagist,symfony/symfony,https://symfony.com,https://github.com/symfony/symfony
2,Packagist,laravel/framework,https://laravel.com,https://github.com/laravel/framework
3,Packagist,guzzlehttp/guzzle,https://guzzlephp.org,https://github.com/guzzle/guzzle

Column Descriptions

ID: Sequential identifier (1, 2, 3, ...)
Platform: Always "Packagist" for PHP packages
Name: Package name as registered on Packagist.org (vendor/package format)
Homepage URL: Project homepage (from package metadata)
Repository URL: Source code repository URL

Note: This format is compatible with the Package-Filter tool for cross-ecosystem analysis.

Processing Details

API Rate Limiting

The script implements rate limiting to avoid overwhelming the Packagist API:

Rate: 20 requests per second (0.05 second delay between requests)
Purpose: Respectful API usage, avoiding server load
Impact: Processing time increases with number of packages

Estimated Time: With ~400,000 packages and 20 req/sec, expect ~5-6 hours total runtime.

Package Naming Convention

PHP packages follow the vendor/package naming pattern:

symfony/console
laravel/framework
doctrine/orm

This two-part naming helps prevent conflicts and organize packages by maintainer.

Package Metadata Sources

For each package, the script fetches:

{
  "package": {
    "name": "symfony/console",
    "homepage": "https://symfony.com",
    "repository": "https://github.com/symfony/symfony",
    "versions": {
      "dev-master": {
        "source": {
          "url": "https://github.com/symfony/symfony.git",
          "type": "git"
        }
      }
    }
  }
}

The script prioritizes:

Homepage: homepage field → "nan"
Repository: repository field → version source URL → "nan"

Repository URL Extraction

The script tries multiple strategies to find repository URLs:

Direct Repository Field: Uses package.repository if available
Version Source: Checks dev-master, dev-main, master, main branches
First Version: Falls back to first available version's source URL
Validation: Ensures URLs start with http or https

Error Handling

If an API call fails (timeout, 404, etc.):

Continues processing with "nan" values
Logs no error (fails silently)
Ensures complete dataset even with some missing data

Files

mine_php.py: Main script
requirements.txt: Python dependencies (requests, tqdm)
setup.sh: Automated setup script
Output: ../../../Resource/Package/Package-List/PHP_New.csv

Troubleshooting

"Error downloading package list"

Check that:

You have internet connectivity
packagist.org is accessible: curl -I https://packagist.org/packages/list.json
No firewall blocking the connection

Script is very slow

This is expected behavior:

Rate limiting (20 requests/second) is intentional
With 400K+ packages, expect 5-6 hours runtime
Consider running overnight or in background

To run in background:

nohup python mine_php.py > output.log 2>&1 &

"Error parsing JSON"

This can occur if:

Packagist API response format changed
Network corruption during download
Server returned error page instead of JSON

Solution: Check internet connection and try again.

"Permission denied" when creating output directory

Ensure you have write permissions to:

Current directory (for temporary files)
Resource/Package/Package-List/ (for output)

Incomplete data (many "nan" values)

This can occur if:

API is temporarily unavailable
Network issues during processing
Some packages have incomplete metadata

Note: This is normal - not all packages have complete metadata on Packagist.org.

Virtual environment issues

If you encounter errors related to the virtual environment:

Delete the venv folder: rm -rf venv
Re-run the setup script: ./setup.sh
Virtual environments cannot be moved after creation - recreate if you move the directory

Performance Notes

Download Time: Fast (package list is relatively small JSON)
Processing Time: SLOW (~5-6 hours for 400K+ packages)
Memory Usage: Low (processes one package at a time)
Network Usage: Moderate (many small API requests)

Optimization Tips

To speed up processing (advanced users):

Reduce delay in time.sleep(0.05) (risks being rate-limited or blocked)
Use parallel requests (requires code modification)
Use Packagist metadata dump if available (check Packagist documentation)

Advantages

Complete Data: Includes all public packages
Official API: Uses Packagist.org official endpoints
Detailed Metadata: Gets homepage and repository URLs
Reliable: Gracefully handles API failures
Rich Metadata: Packagist provides comprehensive package information

Limitations

Slow Processing: Rate limiting means long runtime
API Dependent: Requires Packagist.org to be available
Incomplete Metadata: Not all packages have homepage/repository info
No Version Info: Only captures general package information

Alternative Approaches

For faster processing, consider:

Packagist Dump: Check if Packagist provides database dumps
Metadata Files: Some registries provide metadata files
Cached Data: Use previously downloaded data and update incrementally
Parallel Processing: Use async requests or multiprocessing (advanced)

Code Explanation

Architecture

The PHP Miner uses a two-phase approach:

Phase 1: Download complete list of package names
Phase 2: Fetch detailed metadata for each package

This mirrors the approach used by the Ruby miner, as both ecosystems provide similar API structures.

1. Package List Download

packages_url = "https://packagist.org/packages/list.json"
data = response.json()
package_names = data.get('packageNames', [])

Format: JSON array of package names.

{
  "packageNames": [
    "symfony/console",
    "laravel/framework",
    "guzzlehttp/guzzle",
    ...
  ]
}

Advantages:

Fast download (single request)
Complete list
Simple JSON parsing

2. Package Metadata Fetching

for package_name in package_names:
    time.sleep(0.05)  # Rate limiting (20 req/sec)
    response = requests.get(f"https://packagist.org/packages/{package_name}.json")

Process:

Iterate through each package name
Wait 0.05 seconds (rate limiting)
Fetch JSON metadata
Extract homepage and repository URLs
Write to CSV immediately (streaming write)

3. Complex Repository URL Extraction

# Try direct repository field
repository = package_data.get('repository', '')

# Try version sources
versions = package_data.get('versions', {})
for version_key in ['dev-master', 'dev-main', 'master', 'main']:
    if version_key in versions:
        source = versions[version_key].get('source', {})
        repo_url = source.get('url', '')

Strategy: Multiple fallback levels.

Why Complex: Packagist stores repository info in multiple places:

Direct repository field (not always present)
Version-specific source URLs (most reliable)
Different branch naming conventions (master vs main)

4. Version Priority

for version_key in ['dev-master', 'dev-main', 'master', 'main']:

Priority Order:

dev-master (most common development branch)
dev-main (newer naming convention)
master (tagged version)
main (tagged version)

Fallback: If none found, use first available version.

5. URL Validation

if homepage_url and not homepage_url.startswith('http'):
    homepage_url = "nan"
if repo_url and not repo_url.startswith('http'):
    repo_url = "nan"

Purpose: Filter out invalid URLs.

Some packages have placeholder text instead of URLs
Ensures data quality
"nan" represents missing/invalid data

6. Streaming Write

with open(output_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for package_name in packages:
        # Fetch data
        writer.writerow([...])  # Write immediately

Advantages:

Low memory usage (doesn't store all data in memory)
Progress saved even if script crashes
Can resume partially completed runs (with modification)

Data Quality Notes

Repository URL Accuracy

Packagist repository URLs are generally high quality because:

Composer (PHP package manager) requires this information
Most packages are hosted on GitHub
Package authors maintain metadata actively

Missing Data Patterns

Common reasons for "nan" values:

Abandoned Packages: No longer maintained, incomplete metadata
Private Packages: Listed but not publicly accessible
Vanity URLs: Homepage set to packagist.org page itself
Legacy Packages: Created before Packagist required full metadata

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Jan 19, 2026

1.0.3

Jan 19, 2026

1.0.2

Jan 18, 2026

1.0.1

Jan 18, 2026

This version

1.0.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

php_miner-1.0.0.tar.gz (14.8 kB view details)

Uploaded Jan 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

php_miner-1.0.0-py3-none-any.whl (11.2 kB view details)

Uploaded Jan 18, 2026 Python 3

File details

Details for the file php_miner-1.0.0.tar.gz.

File metadata

Download URL: php_miner-1.0.0.tar.gz
Upload date: Jan 18, 2026
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`eeff456f60b62e5e0250fb3fa0df7dabd19939bb2cee2f44015a49fbf04a8c23`
MD5	`25f7522e4aab281f9dd841c8390b62e8`
BLAKE2b-256	`80cec96e9ff1acd6341fab0fa1b234bb23614646b77a10d63bef647df1e6f4e6`

See more details on using hashes here.

File details

Details for the file php_miner-1.0.0-py3-none-any.whl.

File metadata

Download URL: php_miner-1.0.0-py3-none-any.whl
Upload date: Jan 18, 2026
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for php_miner-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dce24f089de2b7aa2193acc4392e0c09016e29ce457f8a0d526ae886c697a0be`
MD5	`1982f8029cd7d1e05028d38a6e102ffe`
BLAKE2b-256	`f69b251485b0fe53086129012e55c797fe19f4a594227ef8355ec5e234231c1a`

See more details on using hashes here.

php-miner 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PHP/Packagist Miner

Features

Setup

Run the setup script

Manual Setup (Alternative)

Usage

Data Sources

Output Format

Column Descriptions

Processing Details

API Rate Limiting

Package Naming Convention

Package Metadata Sources

Repository URL Extraction

Error Handling

Files

Troubleshooting

"Error downloading package list"

Script is very slow

"Error parsing JSON"

"Permission denied" when creating output directory

Incomplete data (many "nan" values)

Virtual environment issues

Performance Notes

Optimization Tips

Advantages

Limitations

Alternative Approaches

Code Explanation

Architecture

1. Package List Download

2. Package Metadata Fetching

3. Complex Repository URL Extraction

4. Version Priority

5. URL Validation

6. Streaming Write

Data Quality Notes

Repository URL Accuracy

Missing Data Patterns

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes