Skip to main content

Software Metadata Extraction Framework (SOMEF)

Project description

Software Metadata Extraction Framework (SOMEF) DOI Binder

logo

A command line interface for automatically extracting relevant information from readme files.

Authors: Daniel Garijo, Allen Mao, Haripriya Dharmala, Vedant Diwanji, Jiaying Wang and Aidan Kelley.

Contributors: Miguel Ángel García Delgado.

Features

Given a readme file (or a GitHub repository) SOMEF will extract the following categories (if present):

  • Name: Name identifying a software component
  • Full name: Name + owner (owner/name)
  • Full title: If the repository is a short name, we will attempt to extract the longer version of the repository name
  • Description: A description of what the software does.
  • Citation: Preferred citation (usually in .bib form) as the authors have stated in their readme file.
  • Installation instructions: A set of instructions that indicate how to install a target repository
  • Invocation: Execution command(s) needed to run a scientific software component
  • Usage examples: Assumptions and considerations recorded by the authors when executing a software component, or examples on how to use it.
  • Documentation: Where to find additional documentation about a software component.
  • Requirements: Pre-requisites and dependencies needed to execute a software component.
  • Contributors: Contributors to a software component
  • FAQ: Frequently asked questions about a software component
  • Support: Guidelines and links of where to obtain support for a software component
  • License: License and usage terms of a software component
  • Contact: Contact person responsible for maintaining a software component
  • Download URL: URL where to download the target software (typically the installer, package or a tarball to a stable version)
  • DOI: Digital Object Identifier associated with the software (if any)
  • DockerFile: Build file to create a Docker image for the target software
  • Notebooks: Jupyter notebooks included in a repository
  • Executable notebooks: Jupyter notebooks ready for execution (e.g., through myBinder)
  • Owner: Name of the user or organization in charge of the repository
  • Owner type: Type of the owner, user or organization, of the repository
  • Keywords: set of terms used to commonly identify a software component
  • Source code: Link to the source code (typically the repository where the readme can be found)
  • Releases: Pointer to the available versions of a software component
  • Changelog: Description of the changes between versions
  • Issue tracker: Link where to open issues for the target repository
  • Programming languages: Languages used in the repository
  • Acknowledgements: People or institutions that the authors would like to acknowledge in their software component
  • Repository Status: Repository status as it is described in repostatus.org
  • Arxiv Links: Links to Arxiv articles
  • Stargazers count: Total number of stargazers of the project
  • Forks count: Number of forks of the project
  • Forks url: Links to forks made of the project
  • Code of Conduct: Link to the code of conduct of the project
  • Script: Snippets of code contained in the readme file

We use different supervised classifiers, header analysis, regular expressions and the GitHub API to retrieve all these fields (more than one technique may be used for each field)

Documentation

See full documentation at https://somef.readthedocs.io/en/latest/

Cite SOMEF:

@INPROCEEDINGS{9006447, 
author={A. {Mao} and D. {Garijo} and S. {Fakhraei}}, 
booktitle={2019 IEEE International Conference on Big Data (Big Data)}, 
title={SoMEF: A Framework for Capturing Scientific Software Metadata from its Documentation}, 
year={2019},
doi={10.1109/BigData47090.2019.9006447}, 
url={http://dgarijo.com/papers/SoMEF.pdf},
pages={3032-3037}
} 

Requirements

  • Python 3.9

Install from GitHub

To run SOMEF, please follow the next steps:

Clone this GitHub repository

git clone https://github.com/KnowledgeCaptureAndDiscovery/somef.git

Install somef (you should be in the folder that you just cloned). Note that for Python 3.7 and 3.8 the module Cython should be installed in advanced (through the command: pip install Cython).

cd somef
pip install -e .

Test SOMEF installation

somef --help

If everything goes fine, you should see:

Usage: somef [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  configure  Configure credentials
  describe   Running the Command Line Interface
  version    Show somef version.

Installing through Docker

We provide a Docker image with SOMEF already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:

docker build -t somef .

Or just use the Docker image already built in DockerHub:

docker pull kcapd/somef

Then, to run your image just type:

docker run -it kcapd/somef /bin/bash

And you will be ready to use SOMEF (see section below). If you want to have access to the results we recommend mounting a volume. For example, the following command will mount the current directory as the out folder in the Docker image:

docker run -it --rm -v $PWD/:/out kcapd/somef /bin/bash

If you move any files produced by somef into /out, then you will be able to see them in your current directory.

Usage

Configure

Before running SOMEF, you must configure it appropriately. Run

somef configure

And you will be asked to provide the following:

If you want somef to be automatically configured (without GitHUb authentication key and using the default classifiers) just type:

somef configure -a

For showing help about the available options, run:

somef configure --help

Which displays:

Usage: somef configure [OPTIONS]

  Configure GitHub credentials and classifiers file path

Options:
  -a, --auto  Automatically configure SOMEF
  -h, --help  Show this message and exit.

Run SOMEF

$ somef describe --help
  SOMEF Command Line Interface
Usage: somef describe [OPTIONS]

  Running the Command Line Interface

Options:
  -t, --threshold FLOAT           Threshold to classify the text  [required]
  Input: [mutually_exclusive, required]
    -r, --repo_url URL            Github Repository URL
    -d, --doc_src PATH            Path to the README file source
    -i, --in_file PATH            A file of newline separated links to GitHub
                                  repositories

  Output: [required_any]
    -o, --output PATH             Path to the output file. If supplied, the
                                  output will be in JSON

    -c, --codemeta_out PATH       Path to an output codemeta file
    -g, --graph_out PATH          Path to the output Knowledge Graph export
                                  file. If supplied, the output will be a
                                  Knowledge Graph, in the format given in the
                                  --format option chosen (turtle, json-ld)

  -f, --graph_format [turtle|json-ld]
                                  If the --graph_out option is given, this is
                                  the format that the graph will be stored in

  -p, --pretty                    Pretty print the JSON output file so that it
                                  is easy to compare to another JSON output
                                  file.

  -m, --missing                   JSON report with the missing metadata fields
                                  SOMEF was not able to find. The report will
                                  be placed in  $PATH_missing.json, where
                                  $PATH is -o, -c or -g.

  -h, --help                      Show this message and exit.

Usage example:

The following command extracts all metadata available from https://github.com/dgarijo/Widoco/.

somef describe -r https://github.com/dgarijo/Widoco/ -o test.json -t 0.8

Try SOMEF in Binder with our sample notebook: Binder

Add/Remove a Category:

To run a classifier with an additional category or remove an existing one, a corresponding path entry in the config.json should be provided and the category type should be added/removed in the category variable in cli.py

Contribute:

If you want to contribute with a pull request, please do so by submitting it to the dev branch.

Next features:

To see upcoming features, please have a look at our open issues and milestones

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

somef-0.6.0.tar.gz (535.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

somef-0.6.0-py3-none-any.whl (538.4 kB view details)

Uploaded Python 3

File details

Details for the file somef-0.6.0.tar.gz.

File metadata

  • Download URL: somef-0.6.0.tar.gz
  • Upload date:
  • Size: 535.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for somef-0.6.0.tar.gz
Algorithm Hash digest
SHA256 ff9a313c16642cf51949c41b26e10e81f3049dbc2c38fafc5b038190c3f5ee2d
MD5 b76c8887a20689956844058e9b49040a
BLAKE2b-256 bd1bd6f22d1cb3ae4fb4a8d4eafb09da84e62f263b1963be5673b9e2eb1042a4

See more details on using hashes here.

File details

Details for the file somef-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: somef-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 538.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for somef-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0552f7eea2756ebef18e388d441bb8facc8f432c32548335d406323ae450d55
MD5 fcc87a41b7ca77f4e45084bfb9a31083
BLAKE2b-256 dac7900461b1d2a26b3ae1d37d0d77545ed1c1a17f25de3edb751a1f2a8e8a89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page