Get the SPDX License ID from license text
Project description
LicenseID - A portable SPDX License ID matcher
Get the SPDX License ID from license text.
A portable license ID matcher with command line interface and Python API.
Used as a license detection engine for Pitloom software bill of materilas generator.
Features
- Hybrid matching pipeline:
- Tier 0.5 (Marker detection): Detects
SPDX-License-Identifiertags and structured markers (name fields, headings). An exact SPDX tag returns immediately with full confidence. - Tier 0 (Shortcut): Fast path for short inputs (names, IDs, brief
expressions). Includes:
- Case-insensitive exact ID match.
- Prose-context disambiguation for bare deprecated IDs (e.g.
"GPL-2.0 or later version"→GPL-2.0-or-later). - Conservative
-onlyfallback when no granting context is present.
- Tier 1 (Recall): Candidate retrieval using SQLite FTS5 trigram index,
capped at the first 100 query words for consistent performance.
Comment prefixes (
//,#,*,;) are stripped before querying. - Tier 2 (Precision): Adaptive ranking with RapidFuzz. Sliding-window alignment for fragments; coverage-aware scoring to prefer the tightest match. Marker confidence boosts applied only when confidence ≥ 0.85.
- Tier 3 (Validation): Optional final validation via
tools-java.
- Tier 0.5 (Marker detection): Detects
- Deprecated ID normalisation:
GPL-2.0+→GPL-2.0-or-later(SPDX+operator, unambiguous).Apache-2+→Apache-2.0+(abbreviated base canonicalised,+retained).- Bare deprecated IDs (e.g.
GPL-2.0) resolved conservatively to-onlywhen no surrounding context is available.
- Unix philosophy: Parseable, line-delimited CLI output.
Installation
Install with pipx:
pipx install licenseid
Or using uv:
uv tool install licenseid
Usage
1. Update the license database
Before matching, you need to build the local license index:
licenseid update
Advanced update options:
--version <version>: Download a specific SPDX License List version (e.g.,3.28.0).--force: Force update even if the local database is already at the target version.--no-cache: Bypass the local cache for downloads.
2. Identify a license
Identify license text from a file, an ID, or a string:
# From a file (smart detection)
licenseid match LICENSE.txt
# From an ID (smart detection)
licenseid match MIT
# From a string (smart detection / piped)
echo "MIT License..." | licenseid match
# Explicit ID lookup (fastest, skips similarity check)
licenseid match --id MIT
Common options:
--db <path>: Use a custom database path (global option). Supports SQLite URIs for in-memory databases (e.g.,file:test?mode=memory&cache=shared).--id <id>: Explicitly treat input as an SPDX License ID (bypasses file/text matching).--bold: Print only the top license ID (no other info).--diff: Show a word-by-word diff between the input and the best-matching candidate.--json: Output results in JSON format.
The system uses a composite score (similarity + coverage bonus/penalty + optional popularity weight + marker confidence boost) to prefer the tightest match. For example, it distinguishes a short permissive licence from a superset that shares the same preamble.
3. Cache management
licenseid maintains a local cache of remote data to save bandwidth.
licenses.json: Cached for 45 days.popularity.csv: Cached for 75 days.- SPDX data tarballs are versioned and never expire.
To clear the cache manually:
licenseid --clear-cache
4. Output formats
Default (Unix-friendly):
LICENSE_ID=Apache-2.0 SIMILARITY=0.9850 COVERAGE=1.0000
ID only:
licenseid match LICENSE.txt --bold
Example output:
Apache-2.0
JSON:
licenseid match LICENSE.txt --json
Example output:
[
{
"license_id": "Apache-2.0",
"score": 0.985,
"similarity": 0.985,
"coverage": 1.0,
"is_spdx": true,
"is_osi_approved": true
}
]
Diff (visual comparison):
licenseid match LICENSE.txt --diff
Example output:
LICENSE_ID=Apache-2.0 SIMILARITY=0.9980 COVERAGE=0.9975
WORD DIFF:
--- DATABASE
+++ INPUT
@@ -1601,8 +1601,4 @@
language
governing
permissions
-and
-limitations
-under
-the
-license
+se
5. Exit codes
The CLI follows standard Unix exit code conventions, making it suitable for use in scripts and CI/CD pipelines.
| Exit Code | Meaning | Scenarios |
|---|---|---|
| 0 | Success | Confident match found; predicate is TRUE; database updated or already up-to-date. |
| 1 | Logic Failure | No matching license found; predicate is FALSE; network error. |
| 2 | Usage Error | Missing subcommand; missing input text/file; invalid parameters. |
6. License predicates (for CI/CD)
Predicate commands are designed for shell scripting. They print true/false and exit with 0 (for true) or 1 (for false).
| Command | Description |
|---|---|
is-spdx |
True if the license is in the SPDX License List. |
is-open |
True if the license is OSI-approved OR FSF-libre. |
is-free |
Alias for is-open. |
is-osi |
True if the license is OSI-approved. |
is-fsf |
True if the license is FSF-libre. |
Example usage in a script:
# Check by ID
if licenseid is-osi MIT; then
echo "This is an OSI-approved license."
fi
# Check by File
licenseid is-open LICENSE.txt || echo "Warning: Not an open source license"
# Check by Text (via stdin)
echo "MIT License..." | licenseid is-fsf && echo "FSF Libre!"
Python API
You can use licenseid directly in your Python projects:
from licenseid.matcher import AggregatedLicenseMatcher
# Initialize with default database
matcher = AggregatedLicenseMatcher()
# 1. Match by Raw Text (Positional or Keyword)
# Programmatic API is explicit: positional 'text' is always treated as text.
results = matcher.match("Permission is hereby granted...")
results = matcher.match(text="Custom license text...")
# 2. Match by SPDX License ID (Explicit)
# This performs a fast database lookup and returns full metadata.
results = matcher.match(license_id="MIT")
# 3. Match by File Path (Explicit)
results = matcher.match(file_path="LICENSE.txt")
# 4. Predicates
# Supports keyword arguments for precise control.
if matcher.is_osi(license_id="MIT"):
print("OSI Approved!")
if matcher.is_open(file_path="LICENSE.txt"):
print("Open Source!")
if matcher.is_spdx(text="Creative Commons Zero v1.0 Universal"):
print("SPDX Match Found!")
Example JSON output:
[
{
"license_id": "MIT",
"score": 1.01,
"similarity": 1.0,
"coverage": 0.0
}
]
Development
Running tests
Regular test suite:
pytest
Run benchmarks and accuracy tests (expensive):
pytest --run-benchmark
Configuration
SPDX_TOOLS_JAR: Path to thetools-javajar for Tier 3 validation.
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file licenseid-0.2.3.tar.gz.
File metadata
- Download URL: licenseid-0.2.3.tar.gz
- Upload date:
- Size: 141.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
929360fb6aacce924a54d0f04d3991e5ec07ad2a277af2938cc379b529c335c9
|
|
| MD5 |
bb53310c9efba1bd60ffce0f9d1c38e0
|
|
| BLAKE2b-256 |
a01928cc69545d0c3cefe458cffd69d38c8d38315abcb17ca843c24563612da7
|
Provenance
The following attestation bundles were made for licenseid-0.2.3.tar.gz:
Publisher:
pypi-publish.yml on bact/licenseid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
licenseid-0.2.3.tar.gz -
Subject digest:
929360fb6aacce924a54d0f04d3991e5ec07ad2a277af2938cc379b529c335c9 - Sigstore transparency entry: 1526984538
- Sigstore integration time:
-
Permalink:
bact/licenseid@cdba424f8a5a6c2bebed44137fffb7b641d005f9 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/bact
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@cdba424f8a5a6c2bebed44137fffb7b641d005f9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file licenseid-0.2.3-py3-none-any.whl.
File metadata
- Download URL: licenseid-0.2.3-py3-none-any.whl
- Upload date:
- Size: 44.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fbbd12f77476ff00e110b7d7b49ecfb17feb5ce94cb83cc82b52e43b2e280d3
|
|
| MD5 |
c9f0da7b8d43215244e82c8f365a4021
|
|
| BLAKE2b-256 |
9c3d24eb2fef2447cf15adcb984af75ed7f5e877f24bc1761d68594cd66ca450
|
Provenance
The following attestation bundles were made for licenseid-0.2.3-py3-none-any.whl:
Publisher:
pypi-publish.yml on bact/licenseid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
licenseid-0.2.3-py3-none-any.whl -
Subject digest:
8fbbd12f77476ff00e110b7d7b49ecfb17feb5ce94cb83cc82b52e43b2e280d3 - Sigstore transparency entry: 1526984818
- Sigstore integration time:
-
Permalink:
bact/licenseid@cdba424f8a5a6c2bebed44137fffb7b641d005f9 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/bact
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@cdba424f8a5a6c2bebed44137fffb7b641d005f9 -
Trigger Event:
release
-
Statement type: