Mine and extract complete package lists from Packagist/Composer registry
Project description
PHP/Packagist Miner
This tool downloads and processes the Packagist.org package list to extract PHP package information for cross-ecosystem analysis.
Features
- Downloads package list from Packagist.org
- Fetches detailed metadata via Packagist API
- Extracts package metadata (ID, name, homepage, repository)
- Formats data for cross-ecosystem package analysis
- Progress tracking with visual feedback
- Rate-limited API calls to respect server resources
- Generates standardized CSV output compatible with Package-Filter
Setup
Run the setup script
chmod +x setup.sh
./setup.sh
This will:
- Create a virtual environment
- Install required dependencies (requests, tqdm)
- Prepare the environment for mining
Manual Setup (Alternative)
If you prefer to set up manually:
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Important: Virtual environments contain hardcoded paths and cannot be moved after creation. If you need to relocate this script:
- Delete the
venvfolder - Recreate it in the new location
- Reinstall the packages
Usage
Mine all PHP packages from Packagist.org:
source venv/bin/activate
python mine_php.py
Or run directly without activating:
venv/bin/python mine_php.py
The script will:
- Download the list of all package names from Packagist.org
- Fetch detailed information for each package via API
- Generate CSV output in
Resource/Package/Package-List/PHP_New.csv
Data Sources
- Package Names: https://packagist.org/packages/list.json
- Package Details: https://packagist.org/packages/{vendor}/{package}.json
- Format: JSON
Output Format
The script generates PHP_New.csv in the Resource/Package/Package-List/ directory with the following structure:
ID,Platform,Name,Homepage URL,Repository URL
1,Packagist,symfony/symfony,https://symfony.com,https://github.com/symfony/symfony
2,Packagist,laravel/framework,https://laravel.com,https://github.com/laravel/framework
3,Packagist,guzzlehttp/guzzle,https://guzzlephp.org,https://github.com/guzzle/guzzle
Column Descriptions
- ID: Sequential identifier (1, 2, 3, ...)
- Platform: Always "Packagist" for PHP packages
- Name: Package name as registered on Packagist.org (vendor/package format)
- Homepage URL: Project homepage (from package metadata)
- Repository URL: Source code repository URL
Note: This format is compatible with the Package-Filter tool for cross-ecosystem analysis.
Processing Details
API Rate Limiting
The script implements rate limiting to avoid overwhelming the Packagist API:
- Rate: 20 requests per second (0.05 second delay between requests)
- Purpose: Respectful API usage, avoiding server load
- Impact: Processing time increases with number of packages
Estimated Time: With ~400,000 packages and 20 req/sec, expect ~5-6 hours total runtime.
Package Naming Convention
PHP packages follow the vendor/package naming pattern:
symfony/consolelaravel/frameworkdoctrine/orm
This two-part naming helps prevent conflicts and organize packages by maintainer.
Package Metadata Sources
For each package, the script fetches:
{
"package": {
"name": "symfony/console",
"homepage": "https://symfony.com",
"repository": "https://github.com/symfony/symfony",
"versions": {
"dev-master": {
"source": {
"url": "https://github.com/symfony/symfony.git",
"type": "git"
}
}
}
}
}
The script prioritizes:
- Homepage:
homepagefield → "nan" - Repository:
repositoryfield → version source URL → "nan"
Repository URL Extraction
The script tries multiple strategies to find repository URLs:
- Direct Repository Field: Uses
package.repositoryif available - Version Source: Checks
dev-master,dev-main,master,mainbranches - First Version: Falls back to first available version's source URL
- Validation: Ensures URLs start with
httporhttps
Error Handling
If an API call fails (timeout, 404, etc.):
- Continues processing with "nan" values
- Logs no error (fails silently)
- Ensures complete dataset even with some missing data
Files
mine_php.py: Main scriptrequirements.txt: Python dependencies (requests, tqdm)setup.sh: Automated setup script- Output:
../../../Resource/Package/Package-List/PHP_New.csv
Troubleshooting
"Error downloading package list"
Check that:
- You have internet connectivity
- packagist.org is accessible:
curl -I https://packagist.org/packages/list.json - No firewall blocking the connection
Script is very slow
This is expected behavior:
- Rate limiting (20 requests/second) is intentional
- With 400K+ packages, expect 5-6 hours runtime
- Consider running overnight or in background
To run in background:
nohup python mine_php.py > output.log 2>&1 &
"Error parsing JSON"
This can occur if:
- Packagist API response format changed
- Network corruption during download
- Server returned error page instead of JSON
Solution: Check internet connection and try again.
"Permission denied" when creating output directory
Ensure you have write permissions to:
- Current directory (for temporary files)
Resource/Package/Package-List/(for output)
Incomplete data (many "nan" values)
This can occur if:
- API is temporarily unavailable
- Network issues during processing
- Some packages have incomplete metadata
Note: This is normal - not all packages have complete metadata on Packagist.org.
Virtual environment issues
If you encounter errors related to the virtual environment:
- Delete the
venvfolder:rm -rf venv - Re-run the setup script:
./setup.sh - Virtual environments cannot be moved after creation - recreate if you move the directory
Performance Notes
- Download Time: Fast (package list is relatively small JSON)
- Processing Time: SLOW (~5-6 hours for 400K+ packages)
- Memory Usage: Low (processes one package at a time)
- Network Usage: Moderate (many small API requests)
Optimization Tips
To speed up processing (advanced users):
- Reduce delay in
time.sleep(0.05)(risks being rate-limited or blocked) - Use parallel requests (requires code modification)
- Use Packagist metadata dump if available (check Packagist documentation)
Advantages
- Complete Data: Includes all public packages
- Official API: Uses Packagist.org official endpoints
- Detailed Metadata: Gets homepage and repository URLs
- Reliable: Gracefully handles API failures
- Rich Metadata: Packagist provides comprehensive package information
Limitations
- Slow Processing: Rate limiting means long runtime
- API Dependent: Requires Packagist.org to be available
- Incomplete Metadata: Not all packages have homepage/repository info
- No Version Info: Only captures general package information
Alternative Approaches
For faster processing, consider:
- Packagist Dump: Check if Packagist provides database dumps
- Metadata Files: Some registries provide metadata files
- Cached Data: Use previously downloaded data and update incrementally
- Parallel Processing: Use async requests or multiprocessing (advanced)
Code Explanation
Architecture
The PHP Miner uses a two-phase approach:
- Phase 1: Download complete list of package names
- Phase 2: Fetch detailed metadata for each package
This mirrors the approach used by the Ruby miner, as both ecosystems provide similar API structures.
1. Package List Download
packages_url = "https://packagist.org/packages/list.json"
data = response.json()
package_names = data.get('packageNames', [])
Format: JSON array of package names.
{
"packageNames": [
"symfony/console",
"laravel/framework",
"guzzlehttp/guzzle",
...
]
}
Advantages:
- Fast download (single request)
- Complete list
- Simple JSON parsing
2. Package Metadata Fetching
for package_name in package_names:
time.sleep(0.05) # Rate limiting (20 req/sec)
response = requests.get(f"https://packagist.org/packages/{package_name}.json")
Process:
- Iterate through each package name
- Wait 0.05 seconds (rate limiting)
- Fetch JSON metadata
- Extract homepage and repository URLs
- Write to CSV immediately (streaming write)
3. Complex Repository URL Extraction
# Try direct repository field
repository = package_data.get('repository', '')
# Try version sources
versions = package_data.get('versions', {})
for version_key in ['dev-master', 'dev-main', 'master', 'main']:
if version_key in versions:
source = versions[version_key].get('source', {})
repo_url = source.get('url', '')
Strategy: Multiple fallback levels.
Why Complex: Packagist stores repository info in multiple places:
- Direct
repositoryfield (not always present) - Version-specific source URLs (most reliable)
- Different branch naming conventions (master vs main)
4. Version Priority
for version_key in ['dev-master', 'dev-main', 'master', 'main']:
Priority Order:
dev-master(most common development branch)dev-main(newer naming convention)master(tagged version)main(tagged version)
Fallback: If none found, use first available version.
5. URL Validation
if homepage_url and not homepage_url.startswith('http'):
homepage_url = "nan"
if repo_url and not repo_url.startswith('http'):
repo_url = "nan"
Purpose: Filter out invalid URLs.
- Some packages have placeholder text instead of URLs
- Ensures data quality
- "nan" represents missing/invalid data
6. Streaming Write
with open(output_file, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
for package_name in packages:
# Fetch data
writer.writerow([...]) # Write immediately
Advantages:
- Low memory usage (doesn't store all data in memory)
- Progress saved even if script crashes
- Can resume partially completed runs (with modification)
Data Quality Notes
Repository URL Accuracy
Packagist repository URLs are generally high quality because:
- Composer (PHP package manager) requires this information
- Most packages are hosted on GitHub
- Package authors maintain metadata actively
Missing Data Patterns
Common reasons for "nan" values:
- Abandoned Packages: No longer maintained, incomplete metadata
- Private Packages: Listed but not publicly accessible
- Vanity URLs: Homepage set to packagist.org page itself
- Legacy Packages: Created before Packagist required full metadata
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file php_miner-1.0.0.tar.gz.
File metadata
- Download URL: php_miner-1.0.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eeff456f60b62e5e0250fb3fa0df7dabd19939bb2cee2f44015a49fbf04a8c23
|
|
| MD5 |
25f7522e4aab281f9dd841c8390b62e8
|
|
| BLAKE2b-256 |
80cec96e9ff1acd6341fab0fa1b234bb23614646b77a10d63bef647df1e6f4e6
|
File details
Details for the file php_miner-1.0.0-py3-none-any.whl.
File metadata
- Download URL: php_miner-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dce24f089de2b7aa2193acc4392e0c09016e29ce457f8a0d526ae886c697a0be
|
|
| MD5 |
1982f8029cd7d1e05028d38a6e102ffe
|
|
| BLAKE2b-256 |
f69b251485b0fe53086129012e55c797fe19f4a594227ef8355ec5e234231c1a
|