spdx-matcher

A package that enables extracting licenses from free text using spdx license matching algorithm

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

The spdx_matcher module is a tool to help detect licenses from text files.

Simple use is

import spdx_matcher

with open("LICENSE.txt") as myf:
    license_text = f.read()

licenses_detected, percent = spdx_matcher.analyse_license_text(license_text)

The license returned from operator is a simple dictionary of form;

{
  "license": {
    "<spdx license id>": {
      "string":"value"
      ....
    }
  },
  "exceptions": {
    "<spdx exception id>": {
      "string":"value"
      ....
    }
  },
}

Where data is the named attributes in the spdx template license specification for sections named "var" and uses the names as defined in the template.

The matcher object has a number of other useful functions;

Method	Purpose	Aruments	Returns
normalize	Provides means to take raw text and to make it usefule for matching or hashing purposes. Its main behavior is defined from spdx matching specification. At the core it runs through all the basics of normalising i.e. * lowercse * change white spaces to single spaces * normalizes copyright * normalizes urls * applies varietal words from spx list * normalizes quotes to single quote * normalizes - or dashes * removes bullets/numbering In addition via flags you can optionally apply spdx_matcher.LICENSE_HEADER_REMOVAL if set in remove_sections would remove LICENSE HEADER spdx_matcher.COPYRIGHT_REMOVAL when this flag is set lines featuring word "copyright" are removed. spdx_matcher.APPENDIX_ADDENDUM_REMOVAL any text with 'Appendix','Addendum','Exhibit' 'Appendum' is removed. spdxmatcher.REMOVE_NONE normalises but does not remove any of sections previously. spdx_matcher.REMOVE_FINGERPRINT = LICENSE_HEADER_REMOVAL \| COPYRIGHT_REMOVAL this is intended to allow rapid hash matching of license texts as it just removes copyright which is unique to who produced license and license header. To allow license comparison you may want to use these flags in various ways depending on context. Its also worth noting that for license matching none should be removed. The flags can be '&' together to change behavior	license_text - The input text that requires normalizing. remove_sections - Default spdx_matcher.REMOVE_FINGERPRINT remove_sections controls the behavior of the normaliser based on what is in the	text normalised ready for comparison.
analyse_license_text	To parse input text and identify what if any license can be detected in text input.	license_text = non normalised raw license text	spdx.LicenseMatch - object that contans scan results

spdx_matcher is designed to work offline each release contains a cache updated from json from spdx source at https://github.com/spdx/license-list-data/tree/main/json the package includes the script that builds that cache build_spdx_matcher_cache used to build the cache. A copy of cache is bundled and used by default on each release.

You can override the cache with your locally built cache by setting environment variable SPDX_MATCHER_CACHE_FILE set this to alter where to write and read the cache.

Note building the cache also runs checking against sample texts to validate matchers work. Note the spdx data is not perfect as demonstrated when you build but its quite good.

This is currently in early release.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.13

Dec 31, 2023

0.0.12

Oct 5, 2023

0.0.11

Apr 9, 2023

0.0.10

Mar 18, 2023

0.0.9

Mar 18, 2023

0.0.8

Mar 18, 2023

0.0.7

Mar 18, 2023

0.0.6

Mar 17, 2023

0.0.5

Mar 17, 2023

This version

0.0.4

Mar 17, 2023

0.0.3

Mar 17, 2023

0.0.2

Mar 16, 2023

0.0.1

Mar 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spdx_matcher-0.0.4.tar.gz (1.5 MB view hashes)

Uploaded Mar 17, 2023 Source

Built Distribution

spdx_matcher-0.0.4-py3-none-any.whl (1.5 MB view hashes)

Uploaded Mar 17, 2023 Python 3

Hashes for spdx_matcher-0.0.4.tar.gz

Hashes for spdx_matcher-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`215779a35b94013ecd073a370f78e53285de22687e87329b6748c13c82b4c5d8`
MD5	`08b8be3a8b9e04071d98b620d3e9145f`
BLAKE2b-256	`d3fcb2988831e8c389c04b84633982f993b770e16b6036a32e8957f7244a0c33`

Hashes for spdx_matcher-0.0.4-py3-none-any.whl

Hashes for spdx_matcher-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc6fdacedaa6bed362e19fbcffc1bde78343543533b62d6ee591b224c8027459`
MD5	`3ca5f8867bce20d811bd2786556fcc9d`
BLAKE2b-256	`24d90cf2c13c914a07ec5073b6435da33647b1166436acb8231790bafbe67224`