Skip to main content

CLI and library for extracting maSMP/CODEMETA metadata (and sources) from code repositories

Project description

comet-rs

CLI and Python library for extracting maSMP / CODEMETA metadata (plus per‑property sources and confidence) from GitHub and GitLab repositories.

Given a repository URL, comet-rs:

  • Calls the platform API (GitHub / GitLab)
  • Parses files like CITATION.cff, LICENSE, and README.md
  • Optionally enriches with external services (OpenAlex, archives)
  • Builds a maSMP or CODEMETA JSON‑LD document
  • Tracks, for each property, which source set it and with what confidence

Installation

pip install comet-rs

Python 3.10+ is required.


CLI usage

Extract full metadata

comet-rs extract https://github.com/zbmed-semtec/maSMP-metadata-extraction maSMP --with-enrichment

Outputs JSON with:

  • schema: maSMP or CODEMETA
  • code_url: repository URL
  • results: JSON‑LD document
  • enriched_metadata: per‑property source / confidence / category (for maSMP)

Extract a single property (value + source)

comet-rs extract_property https://github.com/zbmed-semtec/maSMP-metadata-extraction author

Example output:

{
  "property_name": "author",
  "property_value": [
    {
      "@type": "Person",
      "familyName": "",
      "givenName": "Daniel",
      "@id": "https://orcid.org/0000-0003-0454-7145"
    }
  ],
  "source": "citation_cff",
  "confidence": 0.93
}

By default, extract_property uses the maSMP schema. To use CODEMETA:

comet-rs extract_property https://github.com/owner/repo name --schema CODEMETA

Compute a FAIRness assessment

comet-rs fairness https://github.com/zbmed-semtec/maSMP-metadata-extraction maSMP

Outputs JSON with:

  • schema: maSMP or CODEMETA
  • code_url: repository URL
  • results: JSON‑LD document used for the assessment
  • fairness: full FAIRness report (overall score, per‑principle scores, and indicator details)

Authentication & rate limits

For public repositories you can often run without a token, but GitHub and GitLab apply rate limits. For heavier use or private repos, set:

export GITHUB_TOKEN=ghp_...      # for github.com URLs
export GITLAB_TOKEN=glpat_...    # for gitlab.com URLs

comet-rs automatically picks the right token based on the repository URL, or you can pass --token explicitly:

comet-rs extract https://gitlab.com/owner/repo maSMP --token glpat_...

Tokens only need minimal read scopes (repo / read:org on GitHub, read_api / read_repository on GitLab).


Python API

You can also call the extractor directly from Python using the comet_rs package.

Full extraction

import os

import comet_rs

jsonld_document, enriched = comet_rs.extract_metadata(
    "https://github.com/zbmed-semtec/maSMP-metadata-extraction",
    schema="maSMP",                              # or "CODEMETA"
    token=os.getenv("GITHUB_TOKEN"),            # or GITLAB_TOKEN for GitLab
    with_enrichment=True,                       # False for JSON‑LD only
)

# jsonld_document: maSMP/CODEMETA JSON‑LD (dict)
# enriched: per‑property source/confidence/category (or None)

Extract a single property in Python

import comet_rs

extracted_at, matches = comet_rs.extract_property(
    "https://github.com/zbmed-semtec/maSMP-metadata-extraction",
    "author",                     # JSON-LD key or entity field name
    schema="maSMP",               # or "CODEMETA"
    token=os.getenv("GITHUB_TOKEN"),
)

for match in matches:
    print("Profile:", match["profile"])
    print("Value:", match["value"])
    print("Source:", match.get("source"))
    print("Confidence:", match.get("confidence"))

FAIRness assessment in Python

import os

import comet_rs

jsonld_document, fairness_report = comet_rs.assess_fairness(
    "https://github.com/zbmed-semtec/maSMP-metadata-extraction",
    schema="maSMP",               # or "CODEMETA"
    token=os.getenv("GITHUB_TOKEN"),
)

print("Overall score:", fairness_report.overall_score)
print("Findable score:", fairness_report.findable.score)
print("Accessible score:", fairness_report.accessible.score)
print("Interoperable score:", fairness_report.interoperable.score)
print("Reusable score:", fairness_report.reusable.score)

Project links & docs

  • Source code: GitHub / GitLab repository where comet-rs is developed
  • Backend architecture and development docs:
    • README.md in the repo root (architecture & local FastAPI server)
    • docs/DEVELOPER_GUIDE.md
    • docs/ADDING_NEW_PLATFORM.md

Use those documents if you want to contribute, run the FastAPI backend locally, or add support for new code hosting platforms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comet_rs-0.1.1.tar.gz (56.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

comet_rs-0.1.1-py3-none-any.whl (60.0 kB view details)

Uploaded Python 3

File details

Details for the file comet_rs-0.1.1.tar.gz.

File metadata

  • Download URL: comet_rs-0.1.1.tar.gz
  • Upload date:
  • Size: 56.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for comet_rs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 37f266c0748a3d5c2822a49e3b46762f08637c59acd4c0e0c494d5709c5963ba
MD5 9e84c38f72cf6fc1e9274ed75043750e
BLAKE2b-256 15460ac312ad18338646f4b6e26c1aabdd11cfb5ba82700d0e0fa3fb8b7392ba

See more details on using hashes here.

File details

Details for the file comet_rs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: comet_rs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 60.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for comet_rs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 14408a9f2294a43302f5a4927ef0240d779624f22d9b444fb9e223191478cc82
MD5 27daa7cf4b308de0ee4b7b867d210b3b
BLAKE2b-256 adcefd77f6098be5e7f4075e59afcdfe59ccd199e3ae60ce55d0d47803519597

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page