Source to ID - Identify package coordinates and repositories from source code using multiple strategies
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
SRC2ID - Source Code to ID
A Python tool that identifies package coordinates (name, version, license, PURL) from source code directories using multiple identification strategies including web search, SCANOSS fingerprinting, and optionally Software Heritage archive.
Overview
src2id helps you identify packages in unknown code by:
- Using multiple identification strategies (hash search, web search, SCANOSS)
- Generating Software Heritage Identifiers (SWHIDs) for content hashing
- Searching across GitHub, Google, and other sources for matching code
- SCANOSS fingerprinting for code similarity detection
- Providing confidence scores and Package URLs (PURLs) for identified packages
- Optionally querying Software Heritage archive (with --use-swh flag)
Features
- Multiple Identification Strategies: Hash search, web search (GitHub, Google), SCANOSS fingerprinting
- Subcomponent Detection: Identifies multiple packages within monorepos and complex projects
- API-Conscious: Optimized strategy order to minimize API calls
- 30x Faster: Performance optimized compared to SWH-only approach
- Exact Matching: Find exact matches using content-based hashing (SWHIDs)
- Confidence Scoring: Multi-factor scoring for match reliability
- Package Coordinate Extraction: Extract name, version, and license information
- PURL Generation: Generate standard Package URLs for identified packages
- Persistent Caching: File-based cache with 24-hour TTL to avoid API rate limits
- Enhanced License Detection: Integration with oslili for improved license detection
- Multiple Output Formats: JSON and table output formats
- Software Heritage Optional: SWH archive querying available with --use-swh flag
Installation
From Source
git clone https://github.com/oscarvalenzuelab/semantic-copycat-src2id.git
cd semantic-copycat-src2id
pip install -e .
Usage
Basic Usage
# Identify packages in a directory
src2id /path/to/source/code
# High confidence matches only
src2id /path/to/source --confidence-threshold 0.85
# JSON output format
src2id /path/to/source --output-format json
# Include Software Heritage checking
src2id /path/to/source --use-swh
# Detect subcomponents in monorepos
src2id /path/to/source --detect-subcomponents
# Skip license detection
src2id /path/to/source --no-license-detection
# Use API token for SWH authentication (when using --use-swh)
src2id /path/to/source --use-swh --api-token YOUR_TOKEN
# Or set via environment variable
export SWH_API_TOKEN=YOUR_TOKEN
src2id /path/to/source --use-swh
# Clear cache and exit
src2id --clear-cache
# Verbose output for debugging
src2id /path/to/source --verbose
API Authentication
Software Heritage (Optional)
When using --use-swh, you can provide a Software Heritage API token:
- Get an API token: Register at https://archive.softwareheritage.org/api/ and generate a token
- Use the token:
- Via command line:
--use-swh --api-token YOUR_TOKEN - Via environment variable:
export SWH_API_TOKEN=YOUR_TOKEN
- Via command line:
Other APIs
The tool can use several APIs for enhanced functionality. All are optional:
GitHub API (Recommended - Free)
export GITHUB_TOKEN=your_github_personal_access_token
- Creates at: https://github.com/settings/tokens
- Increases rate limit from 10 to 30 requests/minute
- Improves repository search accuracy
SCANOSS API (Optional - Free)
export SCANOSS_API_KEY=your_scanoss_key
- Register at: https://www.scanoss.com
- Provides code fingerprinting and similarity detection
- Works without key but with rate limits
SerpAPI (Optional - Paid)
export SERPAPI_KEY=your_serpapi_key
- Sign up at: https://serpapi.com
- Enables Google search for code matching
- Requires paid subscription
Note: The tool works well without any API keys, just with reduced rate limits.
SWHID Validation
# Generate and validate SWHID for a directory
src2id-validate /path/to/directory
# Compare against expected SWHID
src2id-validate /path/to/directory --expected-swhid swh:1:dir:abc123...
# Use fallback implementation
src2id-validate /path/to/directory --use-fallback --verbose
Command Line Options
path: Directory path to analyze (required)--max-depth: Maximum directory depth to scan (default: 3)--confidence-threshold: Minimum confidence to report matches (default: 0.3)--output-format: Output format: 'json' or 'table' (default: table)--use-swh: Include Software Heritage archive checking (optional, slower)--detect-subcomponents: Detect and identify subcomponents in monorepos--no-cache: Disable API response caching--clear-cache: Clear all cached API responses and exit--no-license-detection: Skip automatic license detection from local source--api-token: Software Heritage API token (only used with --use-swh)--verbose: Verbose output for debugging
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
Status
This project is currently in active development. See the Issues page for planned features and known issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_copycat_src2id-1.2.2.tar.gz.
File metadata
- Download URL: semantic_copycat_src2id-1.2.2.tar.gz
- Upload date:
- Size: 54.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c0d4b76108fe16e32d5efe8d56025c3dbbb44635b0151d5ad36268a188869c6
|
|
| MD5 |
d89eb0703c31b44046bb8aad7b5e412e
|
|
| BLAKE2b-256 |
1d014fe695287fd1a4401659724dd4122054b8c12b597e195488f352aabfd9c8
|
Provenance
The following attestation bundles were made for semantic_copycat_src2id-1.2.2.tar.gz:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_src2id-1.2.2.tar.gz -
Subject digest:
5c0d4b76108fe16e32d5efe8d56025c3dbbb44635b0151d5ad36268a188869c6 - Sigstore transparency entry: 451921831
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-src2id@83bee260e742202a1653e24bee9cd2e6bf1b32f1 -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/oscarvalenzuelab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@83bee260e742202a1653e24bee9cd2e6bf1b32f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semantic_copycat_src2id-1.2.2-py3-none-any.whl.
File metadata
- Download URL: semantic_copycat_src2id-1.2.2-py3-none-any.whl
- Upload date:
- Size: 64.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3139e9eb17097a79f1c0677db447085ed4fd238aa54b0e327bb02c1f32262f3e
|
|
| MD5 |
31b2c94e57f6af8e0a0905caa105cd53
|
|
| BLAKE2b-256 |
4461e9e3250c50ddb929a29e9fdfb30bccba40b9720222862c94aac5527ba279
|
Provenance
The following attestation bundles were made for semantic_copycat_src2id-1.2.2-py3-none-any.whl:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_src2id-1.2.2-py3-none-any.whl -
Subject digest:
3139e9eb17097a79f1c0677db447085ed4fd238aa54b0e327bb02c1f32262f3e - Sigstore transparency entry: 451921843
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-src2id@83bee260e742202a1653e24bee9cd2e6bf1b32f1 -
Branch / Tag:
refs/tags/v1.2.2 - Owner: https://github.com/oscarvalenzuelab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@83bee260e742202a1653e24bee9cd2e6bf1b32f1 -
Trigger Event:
release
-
Statement type: