Open Source License Identification Library
Project description
OSS License & Copyright Detector (osslili)
A high-performance tool for identifying licenses and copyright information in local source code. Produces detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers, enabling comprehensive compliance documentation for the SEMCL.ONE ecosystem.
Features
- Three-Tier License Detection: Dice-Sørensen similarity, TLSH fuzzy hashing, and regex pattern matching
- Evidence-Based Output: Exact file paths, confidence scores, and detection methods
- 700+ SPDX Licenses: Comprehensive support for all SPDX license identifiers
- SEMCL.ONE Integration: Works seamlessly with purl2notices, ospac, and other ecosystem tools
How It Works
Three-Tier License Detection System
The tool uses a sophisticated multi-tier approach for maximum accuracy:
-
Tier 1: Dice-Sørensen Similarity with TLSH Confirmation
- Compares license text using Dice-Sørensen coefficient (97% threshold)
- Confirms matches using TLSH fuzzy hashing to prevent false positives
- Achieves 97-100% accuracy on standard SPDX licenses
-
Tier 2: TLSH Fuzzy Hash Matching
- Uses Trend Micro Locality Sensitive Hashing for variant detection
- Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause
- Pre-computed hashes for all 700+ SPDX licenses
-
Tier 3: Pattern Recognition
- Regex-based detection for license references and identifiers
- Extracts from comments, headers, and documentation
Additional Detection Methods
- Package Metadata Scanning: Detects licenses from package.json, composer.json, pyproject.toml, etc.
- Copyright Extraction: Advanced pattern matching with validation and deduplication
- SPDX Identifier Detection: Finds SPDX-License-Identifier tags in source files
Installation
pip install osslili
For development:
git clone https://github.com/SemClone/osslili.git
cd osslili
pip install -e .
Quick Start
# Fast default scan (LICENSE files + metadata + docs) - RECOMMENDED
osslili .
# Comprehensive deep scan (all source files)
osslili . --deep
# Generate SBOM with license evidence
osslili ./my-project -f cyclonedx-json -o sbom.json
Usage
Scanning Modes
osslili offers three scanning modes optimized for different use cases:
Default Mode (Recommended)
Fast and practical - scans LICENSE files, package metadata, and documentation.
# Scans: LICENSE*, README*, *.md, *.txt, package.json, go.mod, etc.
osslili ./my-project
What it scans:
- LICENSE files: LICENSE, COPYING, NOTICE, COPYRIGHT, etc. (28+ patterns)
- Documentation: README, CHANGELOG, CONTRIBUTING (.txt, .md, .rst, .adoc)
- Package metadata: package.json, go.mod, Cargo.toml, pom.xml, etc. (40+ files)
- Coverage: 12+ package ecosystems (npm, Python, Go, Java, .NET, Rust, Ruby, PHP, Swift, Dart, Elixir, Scala)
Performance: ~8 seconds on ffmpeg-6.0 (4,000+ files) Use case: Daily development, CI/CD pipelines, quick license checks
Deep Mode (Comprehensive)
Thorough scan of all source files for embedded licenses.
# Scans ALL files: .py, .js, .java, .c, .go, etc.
osslili ./my-project --deep
Performance: ~5 minutes on ffmpeg-6.0 (40x slower than default) Use case: Legal compliance reviews, finding embedded license headers
Strict Mode (Fastest)
LICENSE files only - maximum speed.
# Scans ONLY LICENSE files (no metadata, no README)
osslili ./my-project --license-files-only
Performance: ~7 seconds on ffmpeg-6.0 Use case: When you only need declared licenses
CLI Usage
# Default scan - fast and smart (RECOMMENDED)
osslili /path/to/project
# Deep scan - comprehensive but slower
osslili /path/to/project --deep
# Strict scan - LICENSE files only
osslili /path/to/project --license-files-only
# Generate different output formats
osslili ./my-project -f kissbom -o kissbom.json
osslili ./my-project -f cyclonedx-json -o sbom.json
osslili ./my-project -f cyclonedx-xml -o sbom.xml
# Scan with parallel processing (default: 4 threads)
osslili ./my-project --threads 8
# Scan with limited depth (only 2 levels deep)
osslili ./my-project --max-depth 2
# Extract and scan archives
osslili package.tar.gz --max-extraction-depth 2
# Use caching for faster repeated scans
osslili ./my-project --cache-dir ~/.cache/osslili
# Check version
osslili --version
# Save results to file
osslili ./my-project -o license-evidence.json
# With custom configuration and verbose output
osslili ./src --config config.yaml --verbose
# Debug mode for detailed logging
osslili ./project --debug
Example Output
{
"scan_results": [{
"path": "./project",
"license_evidence": [
{
"file": "/path/to/project/LICENSE",
"detected_license": "Apache-2.0",
"confidence": 0.988,
"detection_method": "dice-sorensen",
"category": "declared",
"match_type": "text_similarity",
"description": "Text matches Apache-2.0 license (98.8% similarity)"
},
{
"file": "/path/to/project/package.json",
"detected_license": "Apache-2.0",
"confidence": 1.0,
"detection_method": "tag",
"category": "declared",
"match_type": "spdx_identifier",
"description": "SPDX-License-Identifier: Apache-2.0 found"
}
],
"copyright_evidence": [
{
"file": "/path/to/project/src/main.py",
"holder": "Example Corp",
"years": [2023, 2024],
"statement": "Copyright 2023-2024 Example Corp"
}
]
}],
"summary": {
"total_files_scanned": 42,
"declared_licenses": {"Apache-2.0": 2},
"detected_licenses": {},
"referenced_licenses": {},
"copyright_holders": ["Example Corp"]
}
}
Library Usage
from osslili import LicenseCopyrightDetector
# Initialize detector
detector = LicenseCopyrightDetector()
# Process a local directory
result = detector.process_local_path("/path/to/source")
# Process a single file
result = detector.process_local_path("/path/to/LICENSE")
# Generate different output formats
evidence = detector.generate_evidence([result])
kissbom = detector.generate_kissbom([result])
cyclonedx = detector.generate_cyclonedx([result], format_type="json")
cyclonedx_xml = detector.generate_cyclonedx([result], format_type="xml")
# Access results directly
for license in result.licenses:
print(f"License: {license.spdx_id} ({license.confidence:.0%} confidence)")
print(f" Category: {license.category}") # declared, detected, or referenced
for copyright in result.copyrights:
print(f"Copyright: © {copyright.holder}")
Output Format
The tool outputs JSON evidence showing:
- File path: Where the license was found
- Detected license: The SPDX identifier of the license
- Confidence: How confident the detection is (0.0 to 1.0)
- Match type: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)
- Description: Human-readable description of what was found
Configuration
Create a config.yaml file:
similarity_threshold: 0.97
max_recursion_depth: 10
max_extraction_depth: 10
thread_count: 4
cache_dir: "~/.cache/osslili"
custom_aliases:
"Apache 2": "Apache-2.0"
"MIT License": "MIT"
Documentation
- User Guide - Comprehensive usage examples and configuration
- API Reference - Python API documentation and examples
- SPDX Updates - How to update SPDX license data
- Performance Benchmarks - Comparison with other tools
Contributing
We welcome contributions! Please see CONTRIBUTING.md for details on:
- Code of conduct
- Development setup
- Submitting pull requests
- Reporting issues
Support
For support and questions:
- GitHub Issues - Bug reports and feature requests
- Documentation - Complete project documentation
License
Apache License 2.0 - see LICENSE file for details.
Authors
See AUTHORS.md for a list of contributors.
Part of the SEMCL.ONE ecosystem for comprehensive OSS compliance and code analysis.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file osslili-1.6.2.tar.gz.
File metadata
- Download URL: osslili-1.6.2.tar.gz
- Upload date:
- Size: 386.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdc6a2c3747221d45c98844ed1466f98eac1e82bde71a217651a08a55a6ba16a
|
|
| MD5 |
b505dda6f1fa1b16c30270326938d277
|
|
| BLAKE2b-256 |
c036514daffbfa60c2513734b37f0b16e89c6d1efd90d1df9f156d10a90f930b
|
Provenance
The following attestation bundles were made for osslili-1.6.2.tar.gz:
Publisher:
python-publish.yml on SemClone/osslili
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
osslili-1.6.2.tar.gz -
Subject digest:
bdc6a2c3747221d45c98844ed1466f98eac1e82bde71a217651a08a55a6ba16a - Sigstore transparency entry: 829302364
- Sigstore integration time:
-
Permalink:
SemClone/osslili@6bf2f333884c2d2303b55452831d3a1ff706c64b -
Branch / Tag:
refs/tags/v1.6.2 - Owner: https://github.com/SemClone
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6bf2f333884c2d2303b55452831d3a1ff706c64b -
Trigger Event:
release
-
Statement type:
File details
Details for the file osslili-1.6.2-py3-none-any.whl.
File metadata
- Download URL: osslili-1.6.2-py3-none-any.whl
- Upload date:
- Size: 388.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7946f7ba3fb62cf8503a7084861027db2713017491679428ccdb6dbed1eeff01
|
|
| MD5 |
743fe6f647151d2bef9ad7f609925e64
|
|
| BLAKE2b-256 |
54311332b9cb5323a19f08782adcab4aa954d50bb3625954470a69c9548ab803
|
Provenance
The following attestation bundles were made for osslili-1.6.2-py3-none-any.whl:
Publisher:
python-publish.yml on SemClone/osslili
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
osslili-1.6.2-py3-none-any.whl -
Subject digest:
7946f7ba3fb62cf8503a7084861027db2713017491679428ccdb6dbed1eeff01 - Sigstore transparency entry: 829302366
- Sigstore integration time:
-
Permalink:
SemClone/osslili@6bf2f333884c2d2303b55452831d3a1ff706c64b -
Branch / Tag:
refs/tags/v1.6.2 - Owner: https://github.com/SemClone
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6bf2f333884c2d2303b55452831d3a1ff706c64b -
Trigger Event:
release
-
Statement type: