Skip to main content

Get the SPDX License ID from license text

Project description

LicenseID - A portable SPDX License ID matcher

PyPI - Version GitHub License DOI

Get the SPDX License ID from license text.

A portable license ID matcher with command line interface and Python API.

Used as a license detection engine for Pitloom software bill of materilas generator.

Features

  • Hybrid matching pipeline:
    • Tier 0.5 (Marker detection): Detects SPDX-License-Identifier tags and structured markers (name fields, headings). An exact SPDX tag returns immediately with full confidence.
    • Tier 0 (Shortcut): Fast path for short inputs (names, IDs, brief expressions). Includes:
      • Case-insensitive exact ID match.
      • Prose-context disambiguation for bare deprecated IDs (e.g. "GPL-2.0 or later version"GPL-2.0-or-later).
      • Conservative -only fallback when no granting context is present.
    • Tier 1 (Recall): Candidate retrieval using SQLite FTS5 trigram index, capped at the first 100 query words for consistent performance. Comment prefixes (//, #, *, ;) are stripped before querying.
    • Tier 2 (Precision): Adaptive ranking with RapidFuzz. Sliding-window alignment for fragments; coverage-aware scoring to prefer the tightest match. Marker confidence boosts applied only when confidence ≥ 0.85.
    • Tier 3 (Validation): Optional final validation via tools-java.
  • Deprecated ID normalisation:
    • GPL-2.0+GPL-2.0-or-later (SPDX + operator, unambiguous).
    • Apache-2+Apache-2.0+ (abbreviated base canonicalised, + retained).
    • Bare deprecated IDs (e.g. GPL-2.0) resolved conservatively to -only when no surrounding context is available.
  • Unix philosophy: Parseable, line-delimited CLI output.

Installation

Install with pipx:

pipx install licenseid

Or using uv:

uv tool install licenseid

Usage

1. Update the license database

Before matching, you need to build the local license index:

licenseid update

Advanced update options:

  • --version <version>: Download a specific SPDX License List version (e.g., 3.28.0).
  • --force: Force update even if the local database is already at the target version.
  • --no-cache: Bypass the local cache for downloads.

2. Identify a license

Identify license text from a file, an ID, or a string:

# From a file (smart detection)
licenseid match LICENSE.txt

# From an ID (smart detection)
licenseid match MIT

# From a string (smart detection / piped)
echo "MIT License..." | licenseid match

# Explicit ID lookup (fastest, skips similarity check)
licenseid match --id MIT

Common options:

  • --db <path>: Use a custom database path (global option). Supports SQLite URIs for in-memory databases (e.g., file:test?mode=memory&cache=shared).
  • --id <id>: Explicitly treat input as an SPDX License ID (bypasses file/text matching).
  • --bold: Print only the top license ID (no other info).
  • --diff: Show a word-by-word diff between the input and the best-matching candidate.
  • --json: Output results in JSON format.

The system uses a composite score (similarity + coverage bonus/penalty + optional popularity weight + marker confidence boost) to prefer the tightest match. For example, it distinguishes a short permissive licence from a superset that shares the same preamble.

3. Cache management

licenseid maintains a local cache of remote data to save bandwidth.

  • licenses.json: Cached for 45 days.
  • popularity.csv: Cached for 75 days.
  • SPDX data tarballs are versioned and never expire.

To clear the cache manually:

licenseid --clear-cache

4. Output formats

Default (Unix-friendly):

LICENSE_ID=Apache-2.0 SIMILARITY=0.9850 COVERAGE=1.0000

ID only:

licenseid match LICENSE.txt --bold

Example output:

Apache-2.0

JSON:

licenseid match LICENSE.txt --json

Example output:

[
  {
    "license_id": "Apache-2.0",
    "score": 0.985,
    "similarity": 0.985,
    "coverage": 1.0,
    "is_spdx": true,
    "is_osi_approved": true
  }
]

Diff (visual comparison):

licenseid match LICENSE.txt --diff

Example output:

LICENSE_ID=Apache-2.0 SIMILARITY=0.9980 COVERAGE=0.9975

WORD DIFF:
--- DATABASE
+++ INPUT
@@ -1601,8 +1601,4 @@
 language
 governing
 permissions
-and
-limitations
-under
-the
-license
+se

5. Exit codes

The CLI follows standard Unix exit code conventions, making it suitable for use in scripts and CI/CD pipelines.

Exit Code Meaning Scenarios
0 Success Confident match found; predicate is TRUE; database updated or already up-to-date.
1 Logic Failure No matching license found; predicate is FALSE; network error.
2 Usage Error Missing subcommand; missing input text/file; invalid parameters.

6. License predicates (for CI/CD)

Predicate commands are designed for shell scripting. They print true/false and exit with 0 (for true) or 1 (for false).

Command Description
is-spdx True if the license is in the SPDX License List.
is-open True if the license is OSI-approved OR FSF-libre.
is-free Alias for is-open.
is-osi True if the license is OSI-approved.
is-fsf True if the license is FSF-libre.

Example usage in a script:

# Check by ID
if licenseid is-osi MIT; then
  echo "This is an OSI-approved license."
fi

# Check by File
licenseid is-open LICENSE.txt || echo "Warning: Not an open source license"

# Check by Text (via stdin)
echo "MIT License..." | licenseid is-fsf && echo "FSF Libre!"

Python API

You can use licenseid directly in your Python projects:

from licenseid.matcher import AggregatedLicenseMatcher

# Initialize with default database
matcher = AggregatedLicenseMatcher()

# 1. Match by Raw Text (Positional or Keyword)
# Programmatic API is explicit: positional 'text' is always treated as text.
results = matcher.match("Permission is hereby granted...")
results = matcher.match(text="Custom license text...")

# 2. Match by SPDX License ID (Explicit)
# This performs a fast database lookup and returns full metadata.
results = matcher.match(license_id="MIT")

# 3. Match by File Path (Explicit)
results = matcher.match(file_path="LICENSE.txt")

# 4. Predicates
# Supports keyword arguments for precise control.
if matcher.is_osi(license_id="MIT"):
    print("OSI Approved!")

if matcher.is_open(file_path="LICENSE.txt"):
    print("Open Source!")

if matcher.is_spdx(text="Creative Commons Zero v1.0 Universal"):
    print("SPDX Match Found!")

Example JSON output:

[
  {
    "license_id": "MIT",
    "score": 1.01,
    "similarity": 1.0,
    "coverage": 0.0
  }
]

Development

Running tests

Regular test suite:

pytest

Run benchmarks and accuracy tests (expensive):

pytest --run-benchmark

Configuration

  • SPDX_TOOLS_JAR: Path to the tools-java jar for Tier 3 validation.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

licenseid-0.2.3.tar.gz (141.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

licenseid-0.2.3-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file licenseid-0.2.3.tar.gz.

File metadata

  • Download URL: licenseid-0.2.3.tar.gz
  • Upload date:
  • Size: 141.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for licenseid-0.2.3.tar.gz
Algorithm Hash digest
SHA256 929360fb6aacce924a54d0f04d3991e5ec07ad2a277af2938cc379b529c335c9
MD5 bb53310c9efba1bd60ffce0f9d1c38e0
BLAKE2b-256 a01928cc69545d0c3cefe458cffd69d38c8d38315abcb17ca843c24563612da7

See more details on using hashes here.

Provenance

The following attestation bundles were made for licenseid-0.2.3.tar.gz:

Publisher: pypi-publish.yml on bact/licenseid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file licenseid-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: licenseid-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for licenseid-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8fbbd12f77476ff00e110b7d7b49ecfb17feb5ce94cb83cc82b52e43b2e280d3
MD5 c9f0da7b8d43215244e82c8f365a4021
BLAKE2b-256 9c3d24eb2fef2447cf15adcb984af75ed7f5e877f24bc1761d68594cd66ca450

See more details on using hashes here.

Provenance

The following attestation bundles were made for licenseid-0.2.3-py3-none-any.whl:

Publisher: pypi-publish.yml on bact/licenseid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page