Skip to main content

Detect metadata pitfalls in software repositories

Project description

Documentation Status DOI PyPI - Version

Research Software MetaCheck (a Pitfall/Warning Detection Tool)

This project provides an automated tool for detecting common metadata quality issues (pitfalls & Warnings) in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output files to identify various problems in repository metadata files such as codemeta.json, package.json, setup.py, DESCRIPTION, and others.

Overview

MetaCheck identifies 29 different types of metadata quality issues across multiple programming languages (Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and license template placeholders to broken URLs and improperly formatted metadata fields.

You can visit our catalog to see in details what these pitfalls are, where are they usually detected and how to fix them.

Supported Pitfall Types

The tool detects the following categories of issues:

  • Version-related pitfalls: Version mismatches between metadata files and releases
  • License-related pitfalls: Template placeholders, copyright-only licenses, missing version specifications
  • URL validation pitfalls: Broken links for CI, software requirements, download URLs
  • Metadata format pitfalls: Improper field formatting, multiple authors in single fields, etc...
  • Identifier pitfalls: Invalid or missing unique identifiers, bare DOIs
  • Repository reference pitfalls: Mismatched code repositories, Git shorthand usage

Requirements

  • Python 3.11
  • Required Python packages:
    • requests (for URL validation)
    • pathlib (built-in)
    • json (built-in)
    • re (built-in)
    • somef (For extracting metadata from the repositories)

Installation

Using Poetry (Recommended)

  1. Clone the repository:

    git clone https://github.com/SoftwareUnderstanding/RsMetaCheck.git
    cd RsMetaCheck
    
  2. Install with Poetry:

    poetry install
    
  3. Configure SoMEF (optional but recommended): Initially, the installation process will run somef configure -a to automatically set it up and install the necessary packages but the rate limit will be low. If you need more, you should reconfigure SoMEF, you can run the following command:

    poetry run somef configure
    

    Then add your GitHub authentication token to avoid API rate limits when analyzing repositories in batches.

Using pip

Alternatively, you can install directly from GitHub:

pip install git+https://github.com/SoftwareUnderstanding/RsMetaCheck.git

Usage

GitHub Action

RsMetaCheck can be easily integrated into your CI/CD pipelines as a GitHub Action. We have set it up in GitHub Action in the following repository: rs-metacheck-action and is up in GitHub MarketPlace at rsmetacheck actions.

The action will generate all_pitfalls_results.json, along with the pitfalls/ and somef_outputs/ directories directly in your workflow workspace.

Run the Detection Tool locally

Analyze a Single Repository

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse

Analyze a Specific Branch

You can analyze a specific branch of a repository by using the --branch or -b flag:

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --branch develop

Analyze Multiple Repositories from a JSON File

poetry run rsmetacheck --input repositories.json

The repositories.json file should be structured as follows:

{
  "repositories": [
    "https://gitlab.com/example/example_repo_1",
    "https://gitlab.com/example/example_repo_2",
    "https://github.com/example/example_repo_3"
  ]
}

Customize Output Paths

poetry run rsmetacheck --input repositories.json \
  --somef-output ./results/somef \
  --pitfalls-output ./results/pitfalls \
  --analysis-output ./results/summary.json

Skip SoMEF and Analyze Existing Outputs

If you've already run SoMEF separately:

poetry run rsmetacheck --skip-somef --input somef_outputs/*.json

Or for multiple paths:

poetry run rsmetacheck --skip-somef --input my_somef_outputs_1/*.json my_somef_outputs_2/*.json

Verbose Output for Passed Checks

By default, the JSON-LD files generated by RsMetaCheck will only contain information about pitfalls and warnings that were actually detected. If you want to include all tests in the final JSON-LD, even tests that the repository successfully passed, use the --verbose flag:

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --verbose

Output

The tool will:

  • Process all JSON files in the SoMEF output directory (by default somef_outputs created by the tool)
  • Display progress messages showing detected pitfalls
  • Generate JSON-LD files of detailed Pitfalls and Warnings detected by the tool in output_1_pitfalls.jsonld, output_2_pitfalls.jsonld, etc... in pitfalls (by default created by the tool) directory
  • Generate a comprehensive report in all_pitfalls_results.json

The output file contains:

  • EVERSE standardized JSON-LD output of each repository
  • Summary statistics of analyzed repositories
  • Count and percentage for each pitfall type
  • Language-specific breakdown for repositories with target languages

Troubleshooting

Common Issues

  1. "There is no valid repository URL" error: Ensure the JSON file that contains the repositories has a valid structure and that you are inputing the correct path
  2. Network timeouts: Some pitfalls validate URLs and may time out this is normal behavior

Performance Notes

  • URL validation pitfalls may take longer due to network requests
  • Large datasets may require several minutes to complete analysis
  • Progress is displayed in real-time showing which pitfalls are found

Contributing

The system is designed with modularity in mind. Each pitfall detector is implemented as a separate module in the scripts/ directory, making it easy to add new pitfall types or modify existing detection logic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rsmetacheck-0.3.1.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rsmetacheck-0.3.1-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file rsmetacheck-0.3.1.tar.gz.

File metadata

  • Download URL: rsmetacheck-0.3.1.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.11.15 Linux/6.17.0-1010-azure

File hashes

Hashes for rsmetacheck-0.3.1.tar.gz
Algorithm Hash digest
SHA256 211236ee44af252f7d401ffa8e18e6923d22e1b7a556e6441434f7b3e5213888
MD5 a82fe385b444db1f16d9e8ee94c207af
BLAKE2b-256 c80ba268565797fe046402da6a451a7c0eb6dec7f0aec1645e8e9cd17d5cd484

See more details on using hashes here.

File details

Details for the file rsmetacheck-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: rsmetacheck-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.11.15 Linux/6.17.0-1010-azure

File hashes

Hashes for rsmetacheck-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ed82973dcae3e7f30ff338e65ceaa9f2beb2b153047e076e7cb73239ae64c4a
MD5 d0f530dc6ce00e08f5c6c4c0e9502ea6
BLAKE2b-256 aa5a6fecc685133a6fa7c3ef922e70025a13eb04a9a2a60ce01c120769d16858

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page