Skip to main content

Open Source License Identification Library

Project description

OSS License & Copyright Detector (osslili)

License Python 3.8+ PyPI version

A high-performance tool for identifying licenses and copyright information in local source code. Produces detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers, enabling comprehensive compliance documentation for the SEMCL.ONE ecosystem.

Features

  • Three-Tier License Detection: Dice-Sørensen similarity, TLSH fuzzy hashing, and regex pattern matching
  • Evidence-Based Output: Exact file paths, confidence scores, and detection methods
  • 700+ SPDX Licenses: Comprehensive support for all SPDX license identifiers
  • SEMCL.ONE Integration: Works seamlessly with purl2notices, ospac, and other ecosystem tools

How It Works

Three-Tier License Detection System

The tool uses a sophisticated multi-tier approach for maximum accuracy:

  1. Tier 1: Dice-Sørensen Similarity with TLSH Confirmation

    • Compares license text using Dice-Sørensen coefficient (97% threshold)
    • Confirms matches using TLSH fuzzy hashing to prevent false positives
    • Achieves 97-100% accuracy on standard SPDX licenses
  2. Tier 2: TLSH Fuzzy Hash Matching

    • Uses Trend Micro Locality Sensitive Hashing for variant detection
    • Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause
    • Pre-computed hashes for all 700+ SPDX licenses
  3. Tier 3: Pattern Recognition

    • Regex-based detection for license references and identifiers
    • Extracts from comments, headers, and documentation

Additional Detection Methods

  • Package Metadata Scanning: Detects licenses from package.json, composer.json, pyproject.toml, etc.
  • Copyright Extraction: Advanced pattern matching with validation and deduplication
  • SPDX Identifier Detection: Finds SPDX-License-Identifier tags in source files

Installation

pip install osslili

For development:

git clone https://github.com/SemClone/osslili.git
cd osslili
pip install -e .

Quick Start

# Fast default scan (LICENSE files + metadata + docs) - RECOMMENDED
osslili .

# Comprehensive deep scan (all source files)
osslili . --deep

# Generate SBOM with license evidence
osslili ./my-project -f cyclonedx-json -o sbom.json

Usage

Scanning Modes

osslili offers three scanning modes optimized for different use cases:

Default Mode (Recommended)

Fast and practical - scans LICENSE files, package metadata, and documentation.

# Scans: LICENSE*, README*, *.md, *.txt, package.json, go.mod, etc.
osslili ./my-project

What it scans:

  • LICENSE files: LICENSE, COPYING, NOTICE, COPYRIGHT, etc. (28+ patterns)
  • Documentation: README, CHANGELOG, CONTRIBUTING (.txt, .md, .rst, .adoc)
  • Package metadata: package.json, go.mod, Cargo.toml, pom.xml, etc. (40+ files)
  • Coverage: 12+ package ecosystems (npm, Python, Go, Java, .NET, Rust, Ruby, PHP, Swift, Dart, Elixir, Scala)

Performance: ~8 seconds on ffmpeg-6.0 (4,000+ files) Use case: Daily development, CI/CD pipelines, quick license checks

Deep Mode (Comprehensive)

Thorough scan of all source files for embedded licenses.

# Scans ALL files: .py, .js, .java, .c, .go, etc.
osslili ./my-project --deep

Performance: ~5 minutes on ffmpeg-6.0 (40x slower than default) Use case: Legal compliance reviews, finding embedded license headers

Strict Mode (Fastest)

LICENSE files only - maximum speed.

# Scans ONLY LICENSE files (no metadata, no README)
osslili ./my-project --license-files-only

Performance: ~7 seconds on ffmpeg-6.0 Use case: When you only need declared licenses


CLI Usage

# Default scan - fast and smart (RECOMMENDED)
osslili /path/to/project

# Deep scan - comprehensive but slower
osslili /path/to/project --deep

# Strict scan - LICENSE files only
osslili /path/to/project --license-files-only

# Generate different output formats
osslili ./my-project -f kissbom -o kissbom.json
osslili ./my-project -f cyclonedx-json -o sbom.json
osslili ./my-project -f cyclonedx-xml -o sbom.xml

# Scan with parallel processing (default: 4 threads)
osslili ./my-project --threads 8

# Scan with limited depth (only 2 levels deep)
osslili ./my-project --max-depth 2

# Extract and scan archives
osslili package.tar.gz --max-extraction-depth 2

# Use caching for faster repeated scans
osslili ./my-project --cache-dir ~/.cache/osslili

# Check version
osslili --version

# Save results to file
osslili ./my-project -o license-evidence.json

# With custom configuration and verbose output
osslili ./src --config config.yaml --verbose

# Debug mode for detailed logging
osslili ./project --debug

Example Output

{
  "scan_results": [{
    "path": "./project",
    "license_evidence": [
      {
        "file": "/path/to/project/LICENSE",
        "detected_license": "Apache-2.0",
        "confidence": 0.988,
        "detection_method": "dice-sorensen",
        "category": "declared",
        "match_type": "text_similarity",
        "description": "Text matches Apache-2.0 license (98.8% similarity)"
      },
      {
        "file": "/path/to/project/package.json",
        "detected_license": "Apache-2.0",
        "confidence": 1.0,
        "detection_method": "tag",
        "category": "declared",
        "match_type": "spdx_identifier",
        "description": "SPDX-License-Identifier: Apache-2.0 found"
      }
    ],
    "copyright_evidence": [
      {
        "file": "/path/to/project/src/main.py",
        "holder": "Example Corp",
        "years": [2023, 2024],
        "statement": "Copyright 2023-2024 Example Corp"
      }
    ]
  }],
  "summary": {
    "total_files_scanned": 42,
    "declared_licenses": {"Apache-2.0": 2},
    "detected_licenses": {},
    "referenced_licenses": {},
    "copyright_holders": ["Example Corp"]
  }
}

Library Usage

from osslili import LicenseCopyrightDetector

# Initialize detector
detector = LicenseCopyrightDetector()

# Process a local directory
result = detector.process_local_path("/path/to/source")

# Process a single file  
result = detector.process_local_path("/path/to/LICENSE")

# Generate different output formats
evidence = detector.generate_evidence([result])
kissbom = detector.generate_kissbom([result])
cyclonedx = detector.generate_cyclonedx([result], format_type="json")
cyclonedx_xml = detector.generate_cyclonedx([result], format_type="xml")

# Access results directly
for license in result.licenses:
    print(f"License: {license.spdx_id} ({license.confidence:.0%} confidence)")
    print(f"  Category: {license.category}")  # declared, detected, or referenced
for copyright in result.copyrights:
    print(f"Copyright: © {copyright.holder}")

Output Format

The tool outputs JSON evidence showing:

  • File path: Where the license was found
  • Detected license: The SPDX identifier of the license
  • Confidence: How confident the detection is (0.0 to 1.0)
  • Match type: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)
  • Description: Human-readable description of what was found

Configuration

Create a config.yaml file:

similarity_threshold: 0.97
max_recursion_depth: 10
max_extraction_depth: 10
thread_count: 4
cache_dir: "~/.cache/osslili"
custom_aliases:
  "Apache 2": "Apache-2.0"
  "MIT License": "MIT"

Documentation

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details on:

  • Code of conduct
  • Development setup
  • Submitting pull requests
  • Reporting issues

Support

For support and questions:

License

Apache License 2.0 - see LICENSE file for details.

Authors

See AUTHORS.md for a list of contributors.


Part of the SEMCL.ONE ecosystem for comprehensive OSS compliance and code analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osslili-1.6.3.tar.gz (386.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

osslili-1.6.3-py3-none-any.whl (388.8 kB view details)

Uploaded Python 3

File details

Details for the file osslili-1.6.3.tar.gz.

File metadata

  • Download URL: osslili-1.6.3.tar.gz
  • Upload date:
  • Size: 386.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for osslili-1.6.3.tar.gz
Algorithm Hash digest
SHA256 1f24ba437bfc5e58a6b5e9255e54f2022c0cfd9ea6908816db818a6c64330ed5
MD5 86b09f4e877b18797a752a0220c0a654
BLAKE2b-256 01d397f921a8c884330ba1539f3bed021d3ddd19dd122ee57e7ede1120eeb866

See more details on using hashes here.

Provenance

The following attestation bundles were made for osslili-1.6.3.tar.gz:

Publisher: python-publish.yml on SemClone/osslili

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file osslili-1.6.3-py3-none-any.whl.

File metadata

  • Download URL: osslili-1.6.3-py3-none-any.whl
  • Upload date:
  • Size: 388.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for osslili-1.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 257f08131d8370c8e44251224bafab8bea4f0b9daed1e9e90b5f8a56e8954d97
MD5 2526acb1bd5d5478648e05cee00cde4d
BLAKE2b-256 fa4d51727796bcec35f81ce9e2f7f13a5bdf05bfb5337de5597081e7431b248a

See more details on using hashes here.

Provenance

The following attestation bundles were made for osslili-1.6.3-py3-none-any.whl:

Publisher: python-publish.yml on SemClone/osslili

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page