Universal Package Metadata Extractor - Extract metadata from various package formats
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
UPMEX - Universal Package Metadata Extractor
Extract metadata and license information from various package formats with a single tool.
Features
- Multi-Ecosystem Support: Python (wheel, sdist), NPM, Java (JAR, Maven), Gradle, CocoaPods, Conda, Ruby Gems, Rust Crates, Go Modules, NuGet
- License Detection:
- Regex-based detection for 24+ SPDX identifiers
- Dice-Sørensen coefficient for fuzzy matching
- Confidence scoring and multi-license support
- Offline/Online Modes: Default offline mode with optional online enrichment
- NO-ASSERTION Handling: Clear indication when data cannot be determined
- Parent POM Fetching: Automatic retrieval of Maven parent metadata in online mode
- API Integration: ClearlyDefined and Ecosyste.ms support in online mode
- Standardized Output: Consistent JSON structure across all package types
- Native Extraction: No dependency on package managers
- Comprehensive Testing: 95+ tests with full coverage
Installation
# Install from source
git clone https://github.com/oscarvalenzuelab/semantic-copycat-upmex.git
cd semantic-copycat-upmex
pip install -e .
# Install with all features
pip install -e ".[all]"
# Install for development
pip install -e ".[dev]"
Quick Start
from upmex import PackageExtractor
# Create extractor
extractor = PackageExtractor()
# Extract metadata from a package
metadata = extractor.extract("path/to/package.whl")
# Access metadata
print(f"Package: {metadata.name} v{metadata.version}")
print(f"Type: {metadata.package_type.value}")
print(f"License: {metadata.licenses[0].spdx_id if metadata.licenses else 'Unknown'}")
# Convert to JSON
import json
print(json.dumps(metadata.to_dict(), indent=2))
CLI Usage
# Basic extraction (offline mode - default)
upmex extract package.whl
# Online mode - fetches parent POMs and queries APIs
upmex extract --online package.jar
# With pretty JSON output
upmex extract --pretty package.whl
# Output to file
upmex extract package.whl -o metadata.json
# Text format output
upmex extract --format text package.tar.gz
# Detect package type
upmex detect package.jar
# Extract license information with confidence scores
upmex license package.tgz --confidence
Configuration
Configuration can be done via JSON files or environment variables:
Environment Variables
# API Keys
export PME_CLEARLYDEFINED_API_KEY=your-api-key
export PME_ECOSYSTEMS_API_KEY=your-api-key
# Settings
export PME_LOG_LEVEL=DEBUG
export PME_CACHE_DIR=/path/to/cache
export PME_LICENSE_METHODS=regex,dice_sorensen
export PME_OUTPUT_FORMAT=json
Configuration File
Create a config.json:
{
"api": {
"clearlydefined": {
"enabled": true,
"api_key": null
}
},
"license_detection": {
"methods": ["regex", "dice_sorensen"],
"confidence_threshold": 0.85
},
"output": {
"format": "json",
"pretty_print": true
}
}
Supported Package Types
| Ecosystem | Formats | Detection | Metadata | Online Mode | Tested |
|---|---|---|---|---|---|
| Python | .whl, .tar.gz, .zip | ✓ | ✓ | API enrichment | ✓ |
| NPM | .tgz, .tar.gz | ✓ | ✓ | API enrichment | ✓ |
| Java | .jar, .war, .ear | ✓ | ✓ | Parent POM fetch | ✓ |
| Maven | .jar with POM | ✓ | ✓ | Parent POM fetch | ✓ |
| Gradle | build.gradle(.kts) | ✓ | ✓ | API enrichment | ✓ |
| CocoaPods | .podspec(.json) | ✓ | ✓ | API enrichment | ✓ |
| Conda | .conda, .tar.bz2 | ✓ | ✓ | API enrichment | ✓ |
| Ruby | .gem | ✓ | ✓ | API enrichment | ✓ |
| Rust | .crate | ✓ | ✓ | API enrichment | ✓ |
| Go | .zip, .mod, go.mod | ✓ | ✓ | API enrichment | ✓ |
| NuGet | .nupkg | ✓ | ✓ | API enrichment | ✓ |
Performance
- Small packages (< 1MB): < 500ms
- Medium packages (1-50MB): < 2 seconds
- Large packages (50-500MB): < 10 seconds
- Memory usage: < 100MB for packages under 100MB
Development
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=upmex
# Format code
black src/ tests/
# Lint code
ruff check src/ tests/
# Type checking
mypy src/
Project Structure
semantic-copycat-upmex/
├── src/upmex/
│ ├── core/ # Core models and orchestrator
│ ├── extractors/ # Package-specific extractors
│ ├── detectors/ # License detection engines
│ ├── api/ # External API integrations
│ └── utils/ # Utility functions
├── tests/ # Test suite
├── templates/ # Configuration templates
└── config/ # Default configurations
Current Status
UPMEX v0.2.0 is feature-complete with advanced license detection and comprehensive testing.
Implemented Features
- Package type detection for all supported formats
- License Detection System:
- ✅ Regex-based detection for 24+ SPDX identifiers (Issue #1)
- ✅ Dice-Sørensen coefficient for fuzzy matching (Issue #2)
- Confidence scoring and detection method tracking
- Multi-license detection support
- Offline extraction mode (default) with NO-ASSERTION for missing data
- Online mode with:
- Maven parent POM fetching from Maven Central
- ClearlyDefined API integration for license data (Issue #6)
- Ecosyste.ms API integration for metadata enrichment (Issue #7)
- POM header comment parsing for license/author info
- Standardized output across all package types
- CLI interface with JSON and text output formats
- Configuration system with environment variables and JSON files
- Comprehensive test suite with 95+ tests (Issue #9)
Tested Packages
- Python: requests-2.32.4 (wheel format) - full metadata extraction
- NPM: express-5.1.0 (tgz format) - complete package.json parsing
- Maven: guava-33.4.0-jre (JAR format) - POM extraction with parent fetching
Completed Issues
- ✅ Issue #1: Regex-based license detection
- ✅ Issue #2: Dice-Sørensen coefficient
- ✅ Issue #6: ClearlyDefined API integration
- ✅ Issue #7: Ecosyste.ms API integration
- ✅ Issue #9: Comprehensive test suite
Planned
- Fuzzy hash license detection (Issue #3)
- ML-based license classification (Issue #4)
- API integrations (ClearlyDefined, Ecosyste.ms)
- Performance optimizations for large packages
- GitHub Actions CI/CD pipeline
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
Changelog
See CHANGELOG.md for a detailed history of changes.
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_copycat_upmex-0.2.0.tar.gz.
File metadata
- Download URL: semantic_copycat_upmex-0.2.0.tar.gz
- Upload date:
- Size: 44.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75232bfd9db09bb0d98f776ba1f3dc0f5d3f8ebb9c15b018c3c1d6b5cafd60d6
|
|
| MD5 |
c04e10c3b8f6c0dd6d208ac4c298060d
|
|
| BLAKE2b-256 |
532b283ca5cb18b29e2f93ed4d277ad9ee6388d85494be3ac6ab6334e654c23e
|
Provenance
The following attestation bundles were made for semantic_copycat_upmex-0.2.0.tar.gz:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-upmex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_upmex-0.2.0.tar.gz -
Subject digest:
75232bfd9db09bb0d98f776ba1f3dc0f5d3f8ebb9c15b018c3c1d6b5cafd60d6 - Sigstore transparency entry: 375382984
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-upmex@20fb728a177366cbc7a383eba0da1edee8ef6b74 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/oscarvalenzuelab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@20fb728a177366cbc7a383eba0da1edee8ef6b74 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semantic_copycat_upmex-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semantic_copycat_upmex-0.2.0-py3-none-any.whl
- Upload date:
- Size: 58.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15e1abc18d8ca2941c505d253ae7262035cf13c703d38b253ed2ff953a936e25
|
|
| MD5 |
3c8b41f3e7c97312e2ea7b83e99ebf04
|
|
| BLAKE2b-256 |
0a62ae3237184345ad661fa4f65f07e554041e8d81538d041979296d14d6059c
|
Provenance
The following attestation bundles were made for semantic_copycat_upmex-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on oscarvalenzuelab/semantic-copycat-upmex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantic_copycat_upmex-0.2.0-py3-none-any.whl -
Subject digest:
15e1abc18d8ca2941c505d253ae7262035cf13c703d38b253ed2ff953a936e25 - Sigstore transparency entry: 375383003
- Sigstore integration time:
-
Permalink:
oscarvalenzuelab/semantic-copycat-upmex@20fb728a177366cbc7a383eba0da1edee8ef6b74 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/oscarvalenzuelab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@20fb728a177366cbc7a383eba0da1edee8ef6b74 -
Trigger Event:
release
-
Statement type: