Source to ID - Identify package coordinates and repositories from source code using multiple strategies
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
SRC2ID - Source Code to Package ID
A Python tool that identifies package coordinates (name, version, license, PURL) from source code directories using an hybrid discovery strategy with manifest parsing, code fingerprinting, repository search, and Software Heritage archive.
Overview
src2id uses a progressive 4-tier discovery strategy to identify packages:
Tier 1: Fast Manifest Discovery (1-5 seconds)
- UPMEX/Manifest Parsing - Extract declared dependencies from package files (package.json, setup.py, pom.xml, go.mod, Cargo.toml, etc.)
- ✅ Perfect metadata extraction (85-95% confidence)
- ✅ Multi-ecosystem support (PyPI, NPM, Maven, Go, Ruby Gems)
- ✅ Complete package info (name, version, license, PURL)
Tier 2: Parallel Code Discovery (5-15 seconds)
-
SCANOSS Fingerprinting - Code similarity detection via file fingerprints
- ✅ 100% accuracy when fingerprints exist in database
- ✅ Excellent license detection with detailed SPDX information
- ✅ Works with any file type (.c, .py, .cpp, .js, etc.)
-
GitHub Repository Search - Find repositories using project names and keywords
- ✅ Universal coverage - finds repositories for any project
- ✅ Fast execution (~10 seconds total)
- ✅ Good ecosystem identification
Tier 3: Provenance Discovery (Optional, 90+ seconds)
- Software Heritage Archive - Deep source code inventory using content hashing
- ✅ Most comprehensive - finds exact source code matches
- ✅ Historical accuracy - can identify older versions
- ⚠️ Requires opt-in with
--use-swhdue to rate limits
Features
Core Capabilities
- Hybrid Discovery Strategy: Progressive 4-tier approach (manifest → fingerprinting → search → archive)
- Multi-Ecosystem Support: PyPI, NPM, Maven, Go, Ruby Gems, and more
- Cross-Method Validation: SCANOSS confirms GitHub findings, UPMEX validates SCANOSS results
- Confidence Scoring: Multi-factor scoring (85-100% for exact matches)
- Package Coordinate Extraction: Complete metadata (name, version, license, PURL)
Performance & Reliability
- Fast by Default: 5-15 seconds for typical projects (vs 90+ seconds with SWH)
- No API Keys Required: Works well without authentication (SCANOSS, GitHub search)
- Optional API Keys: Enhanced rate limits and accuracy with GitHub/SCANOSS tokens
- Persistent Caching: File-based cache with smart TTL to avoid API rate limits
- Rate Limit Handling: Automatic backoff and retry logic
Discovery Methods
- UPMEX/Manifest Parsing: Extract from package.json, setup.py, pom.xml, go.mod, Cargo.toml, etc.
- SCANOSS Fingerprinting: 100% accuracy code similarity with detailed license detection
- GitHub Repository Search: Universal coverage repository identification
- Software Heritage Archive: Comprehensive source inventory (opt-in with
--use-swh)
Output & Integration
- Multiple Output Formats: JSON and table output formats
- PURL Generation: Standard Package URLs for identified packages
- Enhanced License Detection: Integration with oslili for improved license detection
- Subcomponent Detection: Identifies multiple packages within monorepos and complex projects
Installation
From Source
git clone https://github.com/oscarvalenzuelab/semantic-copycat-src2id.git
cd semantic-copycat-src2id
pip install -e .
Usage
Basic Usage
# Fast discovery (default) - Uses manifest parsing + SCANOSS + GitHub (5-15 seconds)
src2id /path/to/source/code
# Comprehensive discovery - Includes Software Heritage archive (90+ seconds)
src2id /path/to/source --use-swh
# High confidence matches only
src2id /path/to/source --confidence-threshold 0.85
# JSON output format for integration
src2id /path/to/source --output-format json
# Detect subcomponents in monorepos
src2id /path/to/source --detect-subcomponents
# Skip license detection (faster)
src2id /path/to/source --no-license-detection
# Verbose output for debugging
src2id /path/to/source --verbose
# Clear cache and exit
src2id --clear-cache
Discovery Strategy Examples
# Speed-optimized: Manifest parsing only (1-3 seconds)
# Good for: Known projects with package files
src2id /path/to/npm-project # Finds package.json automatically
# Balanced: Default hybrid approach (5-15 seconds)
# Good for: Most use cases, unknown projects
src2id /path/to/unknown-code
# Comprehensive: Include Software Heritage (90+ seconds)
# Good for: Security audits, research, modified code
export SWH_API_TOKEN=your_token # Optional but recommended
src2id /path/to/unknown-code --use-swh
API Authentication
⚠️ No API keys required! The tool works with the free public APIs. API keys only provide enhanced rate limits and additional features.
Recommended API Keys (Optional)
1. GitHub API - Most Valuable (Free, 2 minutes to setup)
export GITHUB_TOKEN=your_github_personal_access_token
- Get token: https://github.com/settings/tokens (no special permissions needed)
- Benefits:
- ✅ Rate limit: 10 → 5000 requests/hour
- ✅ Better search: More accurate repository identification
- ✅ No cost: Completely free
- Impact: Significant improvement for repository discovery
2. SCANOSS API - Nice to Have (Free, optional)
export SCANOSS_API_KEY=your_scanoss_key
- Get token: Register at https://www.scanoss.com
- Benefits:
- ✅ No cost: Free tier available
- ✅ Enhanced rate limits: Premium API endpoint
- ✅ Additional features: Possible extra metadata
- Impact: Minor improvement (SCANOSS works great without key)
3. Software Heritage API - For Heavy Usage (Free, only if using --use-swh)
export SWH_API_TOKEN=your_swh_token
- Get token: Register at https://archive.softwareheritage.org/api/
- Benefits:
- ✅ Bypass rate limits: No 60-second waits
- ✅ Faster comprehensive scans: When using
--use-swh
- Impact: Essential for
--use-swhflag, not needed for default fast mode
Performance Comparison
| Configuration | Typical Time | API Calls | Best For |
|---|---|---|---|
| No API keys | 5-15 seconds | Minimal | Most users |
| + GitHub token | 5-15 seconds | Enhanced | Recommended setup |
| + All tokens | 5-15 seconds | Premium | Production use |
| + SWH mode | 90+ seconds | Heavy | Security audits |
Recommendation: Start with GitHub token only - it's free, fast to setup, and provides the biggest improvement.
SWHID Validation
# Generate and validate SWHID for a directory
src2id-validate /path/to/directory
# Compare against expected SWHID
src2id-validate /path/to/directory --expected-swhid swh:1:dir:abc123...
# Use fallback implementation
src2id-validate /path/to/directory --use-fallback --verbose
Command Line Options
Core Options
path: Directory path to analyze (required)--confidence-threshold: Minimum confidence to report matches (default: 0.3)--output-format: Output format: 'json' or 'table' (default: table)--verbose: Verbose output for debugging
Discovery Control
--use-swh: Include Software Heritage archive checking (optional, adds 90+ seconds)--no-license-detection: Skip automatic license detection from local source (faster)--detect-subcomponents: Detect and identify subcomponents in monorepos--max-depth: Maximum directory depth to scan (default: 2)
Performance & Caching
--no-cache: Disable API response caching--clear-cache: Clear all cached API responses and exit
Authentication
--api-token: Software Heritage API token (only used with --use-swh)- Environment variables:
GITHUB_TOKEN,SCANOSS_API_KEY,SWH_API_TOKEN
Discovery Method Breakdown
# Default: UPMEX + SCANOSS + GitHub (fast)
src2id /path/to/project
# Add Software Heritage (comprehensive but slow)
src2id /path/to/project --use-swh
# Speed vs Comprehensiveness trade-off
src2id /path/to/project --no-license-detection # Faster
src2id /path/to/project --use-swh --verbose # Slower but complete
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
Status
This project is currently in active development. See the Issues page for planned features and known issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_copycat_src2id-1.3.1.tar.gz.
File metadata
- Download URL: semantic_copycat_src2id-1.3.1.tar.gz
- Upload date:
- Size: 64.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
529e50bc06c0e1731d8a9295e42dcc96305eb9d1785b7a66db4df415fa612e7d
|
|
| MD5 |
3c6b5674fd9ab0c6b04f6d10ab8d53da
|
|
| BLAKE2b-256 |
8421e811dfd28461ab428191d281e2274a29d92438bcb061cc6f63b8c454eaa2
|
Provenance
The following attestation bundles were made for semantic_copycat_src2id-1.3.1.tar.gz:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_src2id-1.3.1.tar.gz -
Subject digest:
529e50bc06c0e1731d8a9295e42dcc96305eb9d1785b7a66db4df415fa612e7d - Sigstore transparency entry: 622785228
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-src2id@e1cd0196e4c02cfe3bc930d48410d211db5934d6 -
Branch / Tag:
refs/tags/v1.3.1 - Owner: https://github.com/oscarvalenzuelab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e1cd0196e4c02cfe3bc930d48410d211db5934d6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semantic_copycat_src2id-1.3.1-py3-none-any.whl.
File metadata
- Download URL: semantic_copycat_src2id-1.3.1-py3-none-any.whl
- Upload date:
- Size: 75.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19feb8a68666dbe08c07b2085802adb0d6323bf3784843e87256617d63ecc703
|
|
| MD5 |
aff298f19166d864600cc2bee1ecaf8a
|
|
| BLAKE2b-256 |
b4e0ae58ffe5ae038bd8db4d56c502411f7109294a420ca21202d17023ea4602
|
Provenance
The following attestation bundles were made for semantic_copycat_src2id-1.3.1-py3-none-any.whl:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_src2id-1.3.1-py3-none-any.whl -
Subject digest:
19feb8a68666dbe08c07b2085802adb0d6323bf3784843e87256617d63ecc703 - Sigstore transparency entry: 622785231
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-src2id@e1cd0196e4c02cfe3bc930d48410d211db5934d6 -
Branch / Tag:
refs/tags/v1.3.1 - Owner: https://github.com/oscarvalenzuelab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e1cd0196e4c02cfe3bc930d48410d211db5934d6 -
Trigger Event:
release
-
Statement type: