Skip to main content

Repository Scanner - Version Control System - Scraper

Project description

Repository Scanner Version Control System Scraper (RESC-VCS-SCRAPER)

Python Celery Pydantic CI OpenSSF Scorecard SonarCloud

[!NOTE]

This component is part of Repository Scanner - resc

Table of Contents

  1. About the component
  2. Getting started
  3. Testing

About the component

The RESC-VCS-Scraper component collects all projects and repositories from multiple VCS providers. The supported VCS providers are Bitbucket, Azure Repos, and GitHub.

This component includes two main modules, the project collector and the repository collector. The project collector collects all projects and sends them to the project queue. The repository collector collects projects from the projects queue, fetches its corresponding repositories, and sends them to the repository queue.

Getting started

These instructions will help you to get a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Run locally from source

Preview Prerequisites: RabbitMQ must be up and running locally.
If you have already deployed RESC through helm in Kubernetes, then rabbitmq is already running for you.
Clone the repository, open the Git Bash terminal from /components/resc-vcs-scraper folder, and run below commands.

1. Create virtual environment:

cd components/resc-vcs-scraper
pip install virtualenv
virtualenv venv
source venv/Scripts/activate

2. Install resc_vcs_scraper package:

pip install -e .

3. Set below environment variables:

 export RESC_RABBITMQ_SERVICE_HOST=127.0.0.1   #  The hostname/IP address of the rabbitmq server
 export RESC_RABBITMQ_SERVICE_PORT_AMQP=30902  #  The amqp port of the rabbitmq server
 export RABBITMQ_DEFAULT_VHOST=resc-rabbitmq   #  The virtual host name of the rabbitmq server
 export RABBITMQ_QUEUES_USERNAME=queue_user    #  The username used to connect to the rabbitmq projects and repositories topics
 export RABBITMQ_QUEUES_PASSWORD="" # The password used to connect to the rabbitmq projects and repositories topics, can be found for the value of queues_password field in /deployment/kubernetes/example-values.yaml file
 export VCS_INSTANCES_FILE_PATH="" # The absolute path to vcs_instances_config.json file containing the vcs instances definitions
 export GITHUB_PUBLIC_USERNAME="" # Your GitHub username
 export GITHUB_PUBLIC_TOKEN="" #  Your GitHub personal access token

You need to replace with correct values for RABBITMQ_QUEUES_PASSWORD, VCS_INSTANCES_FILE_PATH, GITHUB_PUBLIC_USERNAME and GITHUB_PUBLIC_TOKEN.

4. Run the collect_projects task:

collect_projects task collects all projects from a given Version Control System Instance, then writes the found projects to a RabbitMQ channel called 'projects'.

This can be done via the command

collect_projects

Structure of vcs instances config json

The vcs_instances_config.json file must have the following format. Note: You can add multiple vcs instances.

Preview

Example:

{
  "vcs_instance_1": {
    "name": "GITHUB_PUBLIC",
	"scope": ["kubernetes"], 
    "exceptions": [],
    "provider_type": "GITHUB_PUBLIC",
    "hostname": "github.com",
    "port": "443",
    "scheme": "https",
    "username": "GITHUB_PUBLIC_USERNAME",
    "token": "GITHUB_PUBLIC_TOKEN",
    "organization": ""
  }
}
  • scope: List of GitHub accounts you want to scan. For example, let's say you want to scan all the repositories for the following Github accounts. https://github.com/kubernetes
    https://github.com/docker

    Then you need to add to the scope the following accounts like : ["kubernetes", "docker"]. All the repositories from those accounts will be scanned.

  • exceptions (optional): If you want to exclude any account from scan, then add it to exceptions. Default is empty exception.

The output messages of collect_projects command has the following format:

{
  "project_key": "kubernetes",
  "vcs_instance_name": "GITHUB_PUBLIC",
}

5. Run collect all repositories task:

This task collects all repositories from a single VCS project, then writes the found repositories to a RabbitMQ channel called 'repositories'.

This can be done via the command:

celery -A vcs_scraper.repository_collector.common worker --loglevel=INFO -E -Q projects

Run locally using Docker

Preview Run the RESC VCS Scraper Docker image locally by running the following commands:
  • Pull the Docker image from registry:
docker pull rescabnamro/resc-vcs-scraper:latest
  • Alternatively, build the Docker image locally by running:
docker build -t rescabnamro/resc-vcs-scraper:latest .
  • Run the vcs-scraper by using below command:
docker run -v <path to vcs_instances_config.json in your local system>:/tmp/vcs_instances_config.json -e RESC_RABBITMQ_SERVICE_HOST="host.docker.internal" -e RESC_RABBITMQ_SERVICE_AMQP_PORT=30902 -e RABBITMQ_DEFAULT_VHOST=resc-rabbitmq -e RABBITMQ_QUEUES_USERNAME=queue_user -e RABBITMQ_QUEUES_PASSWORD="<the password of queue_user>" -e VCS_INSTANCES_FILE_PATH="/tmp/vcs_instances_config.json" -e GITHUB_PUBLIC_USERNAME="<your github username>" -e GITHUB_PUBLIC_TOKEN="<your github personal access token>" --name resc-vcs-scraper rescabnamro/resc-vcs-scraper:latest collect_projects  

To create vcs_instances_config.json file, refer: Structure of vcs_instances_config.json

Testing

(Back to top)

Run below commands to make sure that the unit tests are running and that the code matches quality standards:

Note: To run these tests you need to install tox. This can be done on Linux and Windows with Git Bash.

pip install tox      # install tox locally

tox -v -e sort       # Run this command to validate the import sorting
tox -v -e lint       # Run this command to lint the code according to this repository's standard
tox -v -e pytest     # Run this command to run the unit tests
tox -v               # Run this command to run all of the above tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

resc_vcs_scraper-3.5.2.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

resc_vcs_scraper-3.5.2-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file resc_vcs_scraper-3.5.2.tar.gz.

File metadata

  • Download URL: resc_vcs_scraper-3.5.2.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for resc_vcs_scraper-3.5.2.tar.gz
Algorithm Hash digest
SHA256 e67f5eeae94ab1b7cc92dce7d22a9f13b7168a0a5c4c26ed0ab5219bd9320b7d
MD5 237cb2615c37aa40b8c9a97ad89b6d55
BLAKE2b-256 0210ea94e46bb834d1a86ede48dc3a4b4ac17de507e179519a3b6eb8db994d3b

See more details on using hashes here.

File details

Details for the file resc_vcs_scraper-3.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for resc_vcs_scraper-3.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ecce4153086963bfc5758a2077deaba045bdcd48e42428f3dbcaf6f820a463b8
MD5 a9bc2fdd8f4c2cbdb6adb39a80b587f4
BLAKE2b-256 69600f83f47e7d71df842ca6ee46a31052d461abd9bd6e7e9994e68e32d1ea5a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page