Skip to main content

Detect metadata pitfalls in software repositories

Project description

DOI PyPI - Version

Research Software MetaCheck (a Pitfall/Warning Detection Tool)

This project provides an automated tool for detecting common metadata quality issues (pitfalls & Warnings) in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output files to identify various problems in repository metadata files such as codemeta.json, package.json, setup.py, DESCRIPTION, and others.

Overview

MetaCheck identifies 29 different types of metadata quality issues across multiple programming languages (Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and license template placeholders to broken URLs and improperly formatted metadata fields.

Supported Pitfall Types

The tool detects the following categories of issues:

  • Version-related pitfalls: Version mismatches between metadata files and releases
  • License-related pitfalls: Template placeholders, copyright-only licenses, missing version specifications
  • URL validation pitfalls: Broken links for CI, software requirements, download URLs
  • Metadata format pitfalls: Improper field formatting, multiple authors in single fields, etc...
  • Identifier pitfalls: Invalid or missing unique identifiers, bare DOIs
  • Repository reference pitfalls: Mismatched code repositories, Git shorthand usage

Requirements

  • Python 3.11
  • Required Python packages:
    • requests (for URL validation)
    • pathlib (built-in)
    • json (built-in)
    • re (built-in)
    • somef (For extracting metadata from the repositories)

Installation

Using Poetry (Recommended)

  1. Clone the repository:

    git clone https://github.com/SoftwareUnderstanding/RsMetaCheck.git
    cd RsMetaCheck
    
  2. Install with Poetry:

    poetry install
    
  3. Configure SoMEF (optional but recommended): Initially, the installation process will run somef configure -a to automatically set it up and install the necessary packages but the rate limit will be low. If you need more, you should reconfigure SoMEF, you can run the following command:

    poetry run somef configure
    

    Then add your GitHub authentication token to avoid API rate limits when analyzing repositories in batches.

Using pip

Alternatively, you can install directly from GitHub:

pip install git+https://github.com/SoftwareUnderstanding/RsMetaCheck.git

Usage

GitHub Action

RsMetaCheck can be easily integrated into your CI/CD pipelines as a GitHub Action.

name: RsMetaCheck

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  check-metadata:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Run RsMetaCheck
        uses: SoftwareUnderstanding/RsMetaCheck@v0.2.1 # Update to the latest version tag
        with:
          # Optional: Include passed checks in output (defaults to false)
          verbose: "false"
        env:
          # Optional: Provide token for SoMEF API rate limits
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The action will generate all_pitfalls_results.json, along with the pitfalls/ and somef_outputs/ directories directly in your workflow workspace.

Run the Detection Tool locally

Analyze a Single Repository

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse

Analyze a Specific Branch

You can analyze a specific branch of a repository by using the --branch or -b flag:

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --branch develop

Analyze Multiple Repositories from a JSON File

poetry run rsmetacheck --input repositories.json

The repositories.json file should be structured as follows:

{
  "repositories": [
    "https://gitlab.com/example/example_repo_1",
    "https://gitlab.com/example/example_repo_2",
    "https://github.com/example/example_repo_3"
  ]
}

Customize Output Paths

poetry run rsmetacheck --input repositories.json \
  --somef-output ./results/somef \
  --pitfalls-output ./results/pitfalls \
  --analysis-output ./results/summary.json

Skip SoMEF and Analyze Existing Outputs

If you've already run SoMEF separately:

poetry run rsmetacheck --skip-somef --input somef_outputs/*.json

Or for multiple paths:

poetry run rsmetacheck --skip-somef --input my_somef_outputs_1/*.json my_somef_outputs_2/*.json

Verbose Output for Passed Checks

By default, the JSON-LD files generated by RsMetaCheck will only contain information about pitfalls and warnings that were actually detected. If you want to include all tests in the final JSON-LD, even tests that the repository successfully passed, use the --verbose flag:

poetry run rsmetacheck --input https://github.com/tidyverse/tidyverse --verbose

Output

The tool will:

  • Process all JSON files in the SoMEF output directory (by default somef_outputs created by the tool)
  • Display progress messages showing detected pitfalls
  • Generate JSON-LD files of detailed Pitfalls and Warnings detected by the tool in output_1_pitfalls.jsonld, output_2_pitfalls.jsonld, etc... in pitfalls (by default created by the tool) directory
  • Generate a comprehensive report in all_pitfalls_results.json

The output file contains:

  • EVERSE standardized JSON-LD output of each repository
  • Summary statistics of analyzed repositories
  • Count and percentage for each pitfall type
  • Language-specific breakdown for repositories with target languages

Troubleshooting

Common Issues

  1. "There is no valid repository URL" error: Ensure the JSON file that contains the repositories has a valid structure and that you are inputing the correct path
  2. Network timeouts: Some pitfalls validate URLs and may time out this is normal behavior

Performance Notes

  • URL validation pitfalls may take longer due to network requests
  • Large datasets may require several minutes to complete analysis
  • Progress is displayed in real-time showing which pitfalls are found

Contributing

The system is designed with modularity in mind. Each pitfall detector is implemented as a separate module in the scripts/ directory, making it easy to add new pitfall types or modify existing detection logic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rsmetacheck-0.3.0.tar.gz (96.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rsmetacheck-0.3.0-py3-none-any.whl (136.7 kB view details)

Uploaded Python 3

File details

Details for the file rsmetacheck-0.3.0.tar.gz.

File metadata

  • Download URL: rsmetacheck-0.3.0.tar.gz
  • Upload date:
  • Size: 96.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for rsmetacheck-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6182d21e1c154586e2fc7ba4e7cfc4d1c134030930e27685f08905406c401c15
MD5 b735755038402958addf2601bbd0ef89
BLAKE2b-256 8954d10e7b04c911c33b87713788f17efbf04332a92f3da1abef90abcdab4363

See more details on using hashes here.

File details

Details for the file rsmetacheck-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: rsmetacheck-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 136.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for rsmetacheck-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 329b303a6d2fef651109a83a6ff8dcff4453212db3c2cf0cca19f00feed6b34a
MD5 a3c427e2ee3088089b16eed1e52c88b5
BLAKE2b-256 4c7a467a2eb06c08c783b938876bc924cec2ab661657f9db66c2eadbc1732304

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page