Skip to main content

Source to ID - Identify package coordinates and repositories from source code using multiple strategies

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

SRC2ID - Source Code to ID

A Python tool that identifies package coordinates (name, version, license, PURL) from source code directories using multiple identification strategies including web search, SCANOSS fingerprinting, and optionally Software Heritage archive.

Overview

src2id helps you identify packages in unknown code by:

  1. Using multiple identification strategies (hash search, web search, SCANOSS)
  2. Generating Software Heritage Identifiers (SWHIDs) for content hashing
  3. Searching across GitHub, Google, and other sources for matching code
  4. SCANOSS fingerprinting for code similarity detection
  5. Providing confidence scores and Package URLs (PURLs) for identified packages
  6. Optionally querying Software Heritage archive (with --use-swh flag)

Features

  • Multiple Identification Strategies: Hash search, web search (GitHub, Google), SCANOSS fingerprinting
  • Subcomponent Detection: Identifies multiple packages within monorepos and complex projects
  • API-Conscious: Optimized strategy order to minimize API calls
  • 30x Faster: Performance optimized compared to SWH-only approach
  • Exact Matching: Find exact matches using content-based hashing (SWHIDs)
  • Confidence Scoring: Multi-factor scoring for match reliability
  • Package Coordinate Extraction: Extract name, version, and license information
  • PURL Generation: Generate standard Package URLs for identified packages
  • Persistent Caching: File-based cache with 24-hour TTL to avoid API rate limits
  • Enhanced License Detection: Integration with oslili for improved license detection
  • Multiple Output Formats: JSON and table output formats
  • Software Heritage Optional: SWH archive querying available with --use-swh flag

Installation

From Source

git clone https://github.com/oscarvalenzuelab/semantic-copycat-src2id.git
cd semantic-copycat-src2id
pip install -e .

Usage

Basic Usage

# Identify packages in a directory
src2id /path/to/source/code

# High confidence matches only
src2id /path/to/source --confidence-threshold 0.85

# JSON output format
src2id /path/to/source --output-format json

# Include Software Heritage checking
src2id /path/to/source --use-swh

# Detect subcomponents in monorepos
src2id /path/to/source --detect-subcomponents

# Skip license detection
src2id /path/to/source --no-license-detection

# Use API token for SWH authentication (when using --use-swh)
src2id /path/to/source --use-swh --api-token YOUR_TOKEN

# Or set via environment variable
export SWH_API_TOKEN=YOUR_TOKEN
src2id /path/to/source --use-swh

# Clear cache and exit
src2id --clear-cache

# Verbose output for debugging
src2id /path/to/source --verbose

API Authentication

Software Heritage (Optional)

When using --use-swh, you can provide a Software Heritage API token:

  1. Get an API token: Register at https://archive.softwareheritage.org/api/ and generate a token
  2. Use the token:
    • Via command line: --use-swh --api-token YOUR_TOKEN
    • Via environment variable: export SWH_API_TOKEN=YOUR_TOKEN

Other APIs

The tool can use several APIs for enhanced functionality. All are optional:

GitHub API (Recommended - Free)

export GITHUB_TOKEN=your_github_personal_access_token

SCANOSS API (Optional - Free)

export SCANOSS_API_KEY=your_scanoss_key
  • Register at: https://www.scanoss.com
  • Provides code fingerprinting and similarity detection
  • Works without key but with rate limits

SerpAPI (Optional - Paid)

export SERPAPI_KEY=your_serpapi_key
  • Sign up at: https://serpapi.com
  • Enables Google search for code matching
  • Requires paid subscription

Note: The tool works well without any API keys, just with reduced rate limits.

SWHID Validation

# Generate and validate SWHID for a directory
src2id-validate /path/to/directory

# Compare against expected SWHID
src2id-validate /path/to/directory --expected-swhid swh:1:dir:abc123...

# Use fallback implementation
src2id-validate /path/to/directory --use-fallback --verbose

Command Line Options

  • path: Directory path to analyze (required)
  • --max-depth: Maximum directory depth to scan (default: 3)
  • --confidence-threshold: Minimum confidence to report matches (default: 0.3)
  • --output-format: Output format: 'json' or 'table' (default: table)
  • --use-swh: Include Software Heritage archive checking (optional, slower)
  • --detect-subcomponents: Detect and identify subcomponents in monorepos
  • --no-cache: Disable API response caching
  • --clear-cache: Clear all cached API responses and exit
  • --no-license-detection: Skip automatic license detection from local source
  • --api-token: Software Heritage API token (only used with --use-swh)
  • --verbose: Verbose output for debugging

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.

Status

This project is currently in active development. See the Issues page for planned features and known issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_copycat_src2id-1.1.2.tar.gz (54.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_copycat_src2id-1.1.2-py3-none-any.whl (64.1 kB view details)

Uploaded Python 3

File details

Details for the file semantic_copycat_src2id-1.1.2.tar.gz.

File metadata

  • Download URL: semantic_copycat_src2id-1.1.2.tar.gz
  • Upload date:
  • Size: 54.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for semantic_copycat_src2id-1.1.2.tar.gz
Algorithm Hash digest
SHA256 6eee9e0a8d1e83f1637dc21e3a5fdd7affb38c986e92dd2feddc7ecf6ebc6081
MD5 a0ab785da99a16213d6b0a9b11f51a5e
BLAKE2b-256 633e8adf19e582b2de11c3ba503d9742847352e3d7dc9874b4b1c4e89fd25376

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_copycat_src2id-1.1.2.tar.gz:

Publisher: python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantic_copycat_src2id-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_copycat_src2id-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 15bf9ddcdf92f6a815ec7a87209ca439ea2db072f4ed63eb1da865abce0df3a4
MD5 7801ce66d55f370fb0aa0279cd551583
BLAKE2b-256 7821134214413ac1261cf4a30f4ef8358e9af0c3410c822d96e26e174a233f0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_copycat_src2id-1.1.2-py3-none-any.whl:

Publisher: python-publish.yml on oscarvalenzuelab/semantic-copycat-src2id

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page