Skip to main content

Universal Package Metadata Extractor - Extract metadata from various package formats

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

UPMEX - Universal Package Metadata Extractor

Extract metadata and license information from various package formats with a single tool.

Features

Core Capabilities

  • Universal Package Support: Extract metadata from 13 package ecosystems
  • Multi-Format Detection: Automatic package type identification
  • Standardized Output: Consistent JSON structure across all formats
  • Native Extraction: No dependency on external package managers
  • High Performance: Process packages up to 500MB in under 10 seconds

Advanced Features

  • NO-ASSERTION Handling: Clear indication for unavailable data
  • Dependency Mapping: Full dependency tree with version constraints
  • Author Parsing: Intelligent name/email extraction and normalization
  • Copyright Holder Integration: Automatically adds copyright holders to authors list
  • Repository Detection: Automatic VCS URL extraction
  • Platform Support: Architecture and OS requirement detection
  • Package URL (PURL): Generate standard Package URLs for all packages
  • File Hashing: SHA-1, MD5, and fuzzy hash (TLSH) for package files
  • JSON Organization: Structured output with package, metadata, people, licensing, copyright sections
  • Data Provenance: Track source of each data field for attestation

Supported Ecosystems

  • Python: wheel (.whl), sdist (.tar.gz, .zip)
  • NPM/Node.js: .tgz, .tar.gz packages
  • Java/Maven: .jar, .war, .ear with POM support
  • Gradle: build.gradle, build.gradle.kts files
  • CocoaPods: .podspec, .podspec.json files
  • Conda: .conda (zip), .tar.bz2 packages
  • Perl/CPAN: .tar.gz, .zip with META.json/yml
  • Conan C/C++: conanfile.py, conanfile.txt, .tgz packages
  • Ruby Gems: .gem packages
  • Rust Crates: .crate packages
  • Go Modules: .zip archives, go.mod files
  • NuGet/.NET: .nupkg packages
  • Linux: (Planned) Debian .deb, RPM .rpm

Advanced License & Copyright Detection

  • Powered by OSLiLI: Integration with oslili for accurate license and copyright detection
  • Multi-Method Detection:
    • Tag-based detection for short license identifiers (MIT, Apache-2.0, etc.)
    • SPDX-License-Identifier exact matching
    • Fuzzy hash (TLSH) matching against normalized license texts
    • Regex-based pattern matching with comprehensive SPDX support
    • Confidence scoring (0.0-1.0) with detection method tracking
  • Copyright Extraction: Automatic extraction of copyright statements from source files

API Integrations

  • ClearlyDefined: License and compliance data enrichment
  • Ecosyste.ms: Package registry metadata and dependencies
  • Maven Central: Parent POM resolution and inheritance
  • Offline-First: All features work without internet connectivity

Installation

# Install from source
git clone https://github.com/oscarvalenzuelab/semantic-copycat-upmex.git
cd semantic-copycat-upmex
pip install -e .

# Install with all features (includes oslili for license detection)
pip install -e ".[all]"

# Install for development
pip install -e ".[dev]"

Quick Start

from upmex import PackageExtractor

# Create extractor
extractor = PackageExtractor()

# Extract metadata from a package
metadata = extractor.extract("path/to/package.whl")

# Access metadata
print(f"Package: {metadata.name} v{metadata.version}")
print(f"Type: {metadata.package_type.value}")
print(f"License: {metadata.licenses[0].spdx_id if metadata.licenses else 'Unknown'}")

# Convert to JSON
import json
print(json.dumps(metadata.to_dict(), indent=2))

CLI Usage

# Basic extraction (offline mode - default)
upmex extract package.whl

# Online mode - fetches parent POMs and queries APIs
upmex extract --online package.jar

# With pretty JSON output
upmex extract --pretty package.whl

# Output to file
upmex extract package.whl -o metadata.json

# Text format output
upmex extract --format text package.tar.gz

# Detect package type
upmex detect package.jar

# Extract license information with confidence scores
upmex license package.tgz --confidence

Configuration

Configuration can be done via JSON files or environment variables:

Environment Variables

# API Keys
export PME_CLEARLYDEFINED_API_KEY=your-api-key
export PME_ECOSYSTEMS_API_KEY=your-api-key

# Settings
export PME_LOG_LEVEL=DEBUG
export PME_CACHE_DIR=/path/to/cache
export PME_OUTPUT_FORMAT=json

Configuration File

Create a config.json:

{
  "api": {
    "clearlydefined": {
      "enabled": true,
      "api_key": null
    }
  },
  "output": {
    "format": "json",
    "pretty_print": true
  }
}

Supported Package Types

Ecosystem Formats Detection Metadata Online Mode Tested
Python .whl, .tar.gz, .zip API enrichment
NPM .tgz, .tar.gz API enrichment
Java .jar, .war, .ear Parent POM fetch
Maven .jar with POM Parent POM fetch
Gradle build.gradle(.kts) API enrichment
CocoaPods .podspec(.json) API enrichment
Conda .conda, .tar.bz2 API enrichment
Perl/CPAN .tar.gz, .zip API enrichment
Conan conanfile.py/.txt -
Ruby .gem API enrichment
Rust .crate API enrichment
Go .zip, .mod, go.mod API enrichment
NuGet .nupkg API enrichment

Changelog

See CHANGELOG.md for a detailed history of changes.

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_copycat_upmex-1.5.9.tar.gz (756.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_copycat_upmex-1.5.9-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file semantic_copycat_upmex-1.5.9.tar.gz.

File metadata

  • Download URL: semantic_copycat_upmex-1.5.9.tar.gz
  • Upload date:
  • Size: 756.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semantic_copycat_upmex-1.5.9.tar.gz
Algorithm Hash digest
SHA256 63ffa24f5ce5fc757296401fa07409983c6e108f729dd0108d9d54ae3d4c48d0
MD5 ca96e8d170c01890d0d2d5a2b2ccc47c
BLAKE2b-256 a22d5c86b8143f70fc5e4bcde575e8ab557ce4f10ce59d0b3d408824bb210344

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_copycat_upmex-1.5.9.tar.gz:

Publisher: python-publish.yml on oscarvalenzuelab/semantic-copycat-upmex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantic_copycat_upmex-1.5.9-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_copycat_upmex-1.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 ff8e4aeb4d150fba874fc042d8bb0ab1ab1ce736ae41a7b9f99fdb6fc6de65b6
MD5 d22d1c9214964c117d1f066b03adf679
BLAKE2b-256 5aebdada59fef1c0a218671c3290ea113cd1d042505460c6513062a43ba355d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_copycat_upmex-1.5.9-py3-none-any.whl:

Publisher: python-publish.yml on oscarvalenzuelab/semantic-copycat-upmex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page