Source to PURL - Identify package coordinates and PURLs from source code using UPMEX and multiple discovery strategies
Project description
SRC2PURL - Source Code to Package URL
A Python tool that identifies package coordinates (name, version, license, PURL) from source code directories using an hybrid discovery strategy with manifest parsing, code fingerprinting, repository search, and Software Heritage archive.
Overview
src2purl uses a progressive 2-phase discovery strategy to identify packages:
Phase 1: Hash-based Package Discovery (5-15 seconds)
Primary identification using content fingerprinting and repository search:
-
Directory & Content Scanning - Generate SWHIDs for directories and files
- ✅ Complete file tree analysis with configurable depth
- ✅ SWHID generation for precise content identification
- ✅ Smart filtering of binary files and hidden directories
-
Software Heritage Archive (Optional) - Deep provenance discovery
- ✅ Most comprehensive - finds exact source code matches
- ✅ Historical accuracy - can identify older versions
- ⚠️ Requires opt-in with
--use-swhdue to longer processing time
-
Keyword Search (Fallback) - Multi-platform repository discovery
- GitHub API Search - Repository identification by keywords
- SCANOSS Fingerprinting - Code similarity detection via file fingerprints
- ✅ Universal coverage - finds repositories for any project
- ✅ Fast execution (~10 seconds total)
Phase 2: Manifest-based Validation & Enhancement (1-3 seconds)
Authoritative package metadata extraction and result enhancement:
- UPMEX Manifest Parsing - Universal Package Metadata Extractor
- ✅ Perfect metadata extraction from package files (package.json, setup.py, pom.xml, go.mod, Cargo.toml, etc.)
- ✅ Multi-ecosystem support (PyPI, NPM, Maven, Go, Ruby Gems, PHP, .NET)
- ✅ Complete package info (name, version, license, PURL)
- ✅ Cross-validation with Phase 1 results for enhanced accuracy
Features
Core Capabilities
- 2-Phase Discovery Strategy: Hash-based identification enhanced by manifest parsing
- Multi-Ecosystem Support: PyPI, NPM, Maven, Go, Ruby Gems, PHP, .NET, and more
- Cross-Phase Validation: UPMEX manifest data validates and enhances hash-based findings
- Confidence Scoring: Multi-factor scoring (85-100% for exact matches)
- Package Coordinate Extraction: Complete metadata (name, version, license, PURL)
Performance & Reliability
- Fast by Default: 5-15 seconds for typical projects (vs 90+ seconds with SWH)
- No API Keys Required: Works well without authentication (SCANOSS, GitHub search)
- Optional API Keys: Enhanced rate limits and accuracy with GitHub/SCANOSS tokens
- Persistent Caching: File-based cache with smart TTL to avoid API rate limits
- Rate Limit Handling: Automatic backoff and retry logic
Discovery Methods
- UPMEX Manifest Parsing: Universal Package Metadata Extractor for all major package ecosystems
- SCANOSS Fingerprinting: 100% accuracy code similarity with detailed license detection
- GitHub Repository Search: Universal coverage repository identification
- Software Heritage Archive: Comprehensive source inventory (opt-in with
--use-swh)
Output & Integration
- Multiple Output Formats: JSON and table output formats
- PURL Generation: Standard Package URLs for identified packages
- Enhanced License Detection: Integration with oslili for improved license detection
- Subcomponent Detection: Identifies multiple packages within monorepos and complex projects
Installation
From Source
git clone https://github.com/oscarvalenzuelab/src2purl.git
cd src2purl
pip install -e .
Usage
Basic Usage
# Fast discovery (default) - Uses manifest parsing + SCANOSS + GitHub (5-15 seconds)
src2purl /path/to/source/code
# Comprehensive discovery - Includes Software Heritage archive (90+ seconds)
src2purl /path/to/source --use-swh
# High confidence matches only
src2purl /path/to/source --confidence-threshold 0.85
# JSON output format for integration
src2purl /path/to/source --output-format json
# Detect subcomponents in monorepos
src2purl /path/to/source --detect-subcomponents
# Skip license detection (faster)
src2purl /path/to/source --no-license-detection
# Verbose output for debugging
src2purl /path/to/source --verbose
# Clear cache and exit
src2purl --clear-cache
Discovery Strategy Examples
# Default: 2-Phase hybrid approach (5-15 seconds)
# Phase 1: Hash-based discovery + Phase 2: UPMEX manifest parsing
# Good for: Most use cases, balanced speed and accuracy
src2purl /path/to/project
# Comprehensive: Include Software Heritage archive (90+ seconds)
# Phase 1 includes deep provenance discovery + Phase 2: UPMEX enhancement
# Good for: Security audits, research, historical analysis
export SWH_API_TOKEN=your_token # Optional but recommended
src2purl /path/to/project --use-swh
# High confidence: Only report confident matches
# Filters results to highest confidence findings from both phases
src2purl /path/to/project --confidence-threshold 0.85
API Authentication
⚠️ No API keys required! The tool works with the free public APIs. API keys only provide enhanced rate limits and additional features.
Recommended API Keys (Optional)
1. GitHub API - Most Valuable (Free, 2 minutes to setup)
export GITHUB_TOKEN=your_github_personal_access_token
- Get token: https://github.com/settings/tokens (no special permissions needed)
- Benefits:
- ✅ Rate limit: 10 → 5000 requests/hour
- ✅ Better search: More accurate repository identification
- ✅ No cost: Completely free
- Impact: Significant improvement for repository discovery
2. SCANOSS API - Nice to Have (Free, optional)
export SCANOSS_API_KEY=your_scanoss_key
- Get token: Register at https://www.scanoss.com
- Benefits:
- ✅ No cost: Free tier available
- ✅ Enhanced rate limits: Premium API endpoint
- ✅ Additional features: Possible extra metadata
- Impact: Minor improvement (SCANOSS works great without key)
3. Software Heritage API - For Heavy Usage (Free, only if using --use-swh)
export SWH_API_TOKEN=your_swh_token
- Get token: Register at https://archive.softwareheritage.org/api/
- Benefits:
- ✅ Bypass rate limits: No 60-second waits
- ✅ Faster comprehensive scans: When using
--use-swh
- Impact: Essential for
--use-swhflag, not needed for default fast mode
Performance Comparison
| Configuration | Typical Time | API Calls | Best For |
|---|---|---|---|
| No API keys | 5-15 seconds | Minimal | Most users |
| + GitHub token | 5-15 seconds | Enhanced | Recommended setup |
| + All tokens | 5-15 seconds | Premium | Production use |
| + SWH mode | 90+ seconds | Heavy | Security audits |
Recommendation: Start with GitHub token only - it's free, fast to setup, and provides the biggest improvement.
SWHID Validation
# Generate and validate SWHID for a directory
src2purl-validate /path/to/directory
# Compare against expected SWHID
src2purl-validate /path/to/directory --expected-swhid swh:1:dir:abc123...
# Use fallback implementation
src2purl-validate /path/to/directory --use-fallback --verbose
Command Line Options
Core Options
path: Directory path to analyze (required)--confidence-threshold: Minimum confidence to report matches (default: 0.3)--output-format: Output format: 'json' or 'table' (default: table)--verbose: Verbose output for debugging
Discovery Control
--use-swh: Include Software Heritage archive checking (optional, adds 90+ seconds)--no-license-detection: Skip automatic license detection from local source (faster)--detect-subcomponents: Detect and identify subcomponents in monorepos--max-depth: Maximum directory depth to scan (default: 2)
Performance & Caching
--no-cache: Disable API response caching--clear-cache: Clear all cached API responses and exit
Authentication
--api-token: Software Heritage API token (only used with --use-swh)- Environment variables:
GITHUB_TOKEN,SCANOSS_API_KEY,SWH_API_TOKEN
Discovery Method Breakdown
# Default: 2-Phase Discovery (Hash-based + UPMEX manifest parsing)
src2purl /path/to/project
# Phase 1: SWHID generation → GitHub/SCANOSS search
# Phase 2: UPMEX manifest extraction → Cross-validation & enhancement
# Add Software Heritage for comprehensive provenance discovery
src2purl /path/to/project --use-swh
# Phase 1: SWHID generation → Software Heritage archive → Fallback search
# Phase 2: UPMEX manifest extraction → Cross-validation & enhancement
# Performance optimization options
src2purl /path/to/project --no-license-detection # Skip license enhancement
src2purl /path/to/project --use-swh --verbose # Full discovery with details
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
Status
This project is currently in active development. See the Issues page for planned features and known issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file src2purl-1.3.4.tar.gz.
File metadata
- Download URL: src2purl-1.3.4.tar.gz
- Upload date:
- Size: 68.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f81d474b410c3bd2a43f926c92270351eb67b1729a79b10bb3ce40d7bad93068
|
|
| MD5 |
fbe011c47ed41b6a403d1c5bfdc1726e
|
|
| BLAKE2b-256 |
0b39671ef9d6100d4f03d9c7cc8588d729655373ffd27597ab7ecf293113cc3e
|
Provenance
The following attestation bundles were made for src2purl-1.3.4.tar.gz:
Publisher:
python-publish.yml on SemClone/src2purl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
src2purl-1.3.4.tar.gz -
Subject digest:
f81d474b410c3bd2a43f926c92270351eb67b1729a79b10bb3ce40d7bad93068 - Sigstore transparency entry: 647868033
- Sigstore integration time:
-
Permalink:
SemClone/src2purl@31d30757523971027dde4a6c04db04dc1a6bbda2 -
Branch / Tag:
refs/tags/v1.3.4 - Owner: https://github.com/SemClone
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@31d30757523971027dde4a6c04db04dc1a6bbda2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file src2purl-1.3.4-py3-none-any.whl.
File metadata
- Download URL: src2purl-1.3.4-py3-none-any.whl
- Upload date:
- Size: 75.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93726bbaef088cf2105b1052b0246889deeae4d65d11ba67f0c737fda71ce79c
|
|
| MD5 |
5a32f9f76adf15f20f8924a7524f645b
|
|
| BLAKE2b-256 |
0d917a85f34f83b45318af21b594c2bea396b9153ea1b9b95ff482ece3ddd766
|
Provenance
The following attestation bundles were made for src2purl-1.3.4-py3-none-any.whl:
Publisher:
python-publish.yml on SemClone/src2purl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
src2purl-1.3.4-py3-none-any.whl -
Subject digest:
93726bbaef088cf2105b1052b0246889deeae4d65d11ba67f0c737fda71ce79c - Sigstore transparency entry: 647868036
- Sigstore integration time:
-
Permalink:
SemClone/src2purl@31d30757523971027dde4a6c04db04dc1a6bbda2 -
Branch / Tag:
refs/tags/v1.3.4 - Owner: https://github.com/SemClone
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@31d30757523971027dde4a6c04db04dc1a6bbda2 -
Trigger Event:
release
-
Statement type: