Skip to main content

Package for extracting software repository metadata

Project description

Scraper

Scraper is a tool for scraping and visualizing open source data from various code hosting platforms, such as: GitHub.com, GitHub Enterprise, GitLab.com, hosted GitLab, and Bitbucket Server.

Getting Started: Code.gov

Code.gov is a newly launched website of the US Federal Government to allow the People to access metadata from the governments custom developed software. This site requires metadata to function, and this Python library can help with that!

To get started, you will need a GitHub Personal Auth Token to make requests to the GitHub API. This should be set in your environment or shell rc file with the name GITHUB_API_TOKEN:

    $ export GITHUB_API_TOKEN=XYZ

    $ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc

Additionally, to perform the labor hours estimation, you will need to install cloc into your environment. This is typically done with a Package Manager such as npm or homebrew.

Then to generate a code.json file for your agency, you will need a config.json file to coordinate the platforms you will connect to and scrape data from. An example config file can be found in demo.json. Once you have your config file, you are ready to install and run the scraper!

    # Install Scraper from a local copy of this repository
    $ pip install -e .
    # OR
    # Install Scraper from PyPI
    $ pip install llnl-scraper

    # Run Scraper with your config file ``config.json``
    $ scraper --config config.json

A full example of the resulting code.json file can be found here.

Config File Options

The configuration file is a json file that specifies what repository platforms to pull projects from as well as some settings that can be used to override incomplete or inaccurate data returned via the scraping.

The basic structure is:

{
    // REQUIRED
    "contact_email": "...",  // Used when the contact email cannot be found otherwise

    // OPTIONAL
    "agency": "...",         // Your agency abbreviation here
    "organization": "...",   // The organization within the agency
    "permissions": { ... },  // Object containing default values for usageType and exemptionText

    // Platform configurations, described in more detail below
    "GitHub": [ ... ],
    "GitLab": [ ... ],
    "Bitbucket": [ ... ],
}
"GitHub": [
    {
        "url": "https://github.com",  // GitHub.com or GitHub Enterprise URL to inventory
        "token": null,                // Private token for accessing this GitHub instance
        "public_only": true,          // Only inventory public repositories

        "connect_timeout": 4,  // The timeout in seconds for connecting to the server
        "read_timeout": 10,    // The timeout in seconds to wait for a response from the server

        "orgs": [ ... ],    // List of organizations to inventory
        "repos": [ ... ],   // List of single repositories to inventory
        "exclude": [ ... ]  // List of organizations / repositories to exclude from inventory
    }
],
"GitLab": [
    {
        "url": "https://gitlab.com",  // GitLab.com or hosted GitLab instance URL to inventory
        "token": null,                // Private token for accessing this GitHub instance
        "fetch_languages": false,     // Include individual calls to API for language metadata. Very slow, so defaults to false. (eg, for 191 projects on internal server, 5 seconds for False, 12 minutes, 38 seconds for True)

        "orgs": [ ... ],    // List of organizations to inventory
        "repos": [ ... ],   // List of single repositories to inventory
        "exclude": [ ... ]  // List of groups / repositories to exclude from inventory
    }
]
"Bitbucket": [
    {
        "url": "https://bitbucket.internal",  // Base URL for a Bitbucket Server instance
        "username": "",                       // Username to authenticate with
        "password": "",                       // Password to authenticate with
        "token": "",                          // Token to authenticate with, if supplied username and password are ignored

        "exclude": [ ... ]  // List of projects / repositories to exclude from inventory
    }
]
"TFS": [
    {
        "url": "https://tfs.internal",  // Base URL for a Team Foundation Server (TFS) or Visual Studio Team Services (VSTS)
        "token": null,                  // Private token for accessing this TFS instance

        "exclude": [ ... ]  // List of projects / repositories to exclude from inventory
    }
]
"AzureDevOps": [
    {
        "url": "https://dev.azure.com",  // Base URL for an Azure Dev Ops Server or Azure Dev Ops Cloud instance
        "token": null,                  // Personal Access Token for accessing this ADO instance
        "apiVersion": "",               // API Version
        "exclude": [ ... ]  // List of projects to exclude from inventory
    }
]

License

Scraper is released under an MIT license. For more details see the LICENSE file.

LLNL-CODE-705597

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llnl_scraper-0.16.0.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llnl_scraper-0.16.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file llnl_scraper-0.16.0.tar.gz.

File metadata

  • Download URL: llnl_scraper-0.16.0.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for llnl_scraper-0.16.0.tar.gz
Algorithm Hash digest
SHA256 29ce815c95f114c71306e6b6ccb4c4c0acfe6053a3ee86a634ec354b52a8ffb3
MD5 c7f985fa22d44bdb430082f12d8c6e23
BLAKE2b-256 997310f5aceee0c420d02098e73fa97aa2b4462d093a1ff73eeb9e6196d5b5a5

See more details on using hashes here.

File details

Details for the file llnl_scraper-0.16.0-py3-none-any.whl.

File metadata

  • Download URL: llnl_scraper-0.16.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for llnl_scraper-0.16.0-py3-none-any.whl
Algorithm Hash digest
SHA256 016ee14c95b928a0f0e084a420aa555b42ebb19942bd2e0d2fa2d0686a148dbb
MD5 8e57a27908106688726a9ebad88f722e
BLAKE2b-256 f8ecc8a5c8f0721d4c0532a8a8408b023d2a76b0c6fda37395d81a3267cfdd8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page