Skip to main content

Make ZIM files from DevDocs.io

Project description

Devdocs scraper

This scraper downloads devdocs.io documentation databases and puts them in ZIM files, a clean and user friendly format for storing content for offline usage.

CodeFactor License: GPL v3 codecov PyPI version shields.io PyPI - Python Version

Installation

There are three main ways to install and use devdocs2zim from most recommended to least:

Install using a pre-built container
  1. Download the image using docker:

    docker pull ghcr.io/openzim/devdocs
    
Build your own container
  1. Clone the repository locally:

    git clone https://github.com/openzim/devdocs.git && cd devdocs
    
  2. Build the image:

    docker build -t ghcr.io/openzim/devdocs .
    
Run the software locally using Hatch
  1. Clone the repository locally:

    git clone https://github.com/openzim/devdocs.git && cd devdocs
    
  2. Install Hatch:

    pip3 install hatch
    
  3. Start a hatch shell to install software and dependencies in an isolated virtual environment.

    hatch shell
    
  4. Run the devdocs2zim command:

    devdocs2zim --help
    

Usage

[!WARNING] This project is still a work in progress and isn't ready for use yet, the commands below are examples only.

# Usage
docker run -v my_dir:/output ghcr.io/openzim/devdocs devdocs2zim [--all|--slug=SLUG|--first=N]

# Fetch all documents
docker run -v my_dir:/output ghcr.io/openzim/devdocs devdocs2zim --all

# Fetch all documents except Ansible
docker run -v my_dir:/output ghcr.io/openzim/devdocs devdocs2zim --all --skip-slug-regex "^ansible.*"

# Fetch Vue related documents
docker run -v my_dir:/output ghcr.io/openzim/devdocs devdocs2zim --slug vue~3 --slug vue_router~4

# Fetch the docs for the two most recent versions of each software
docker run -v my_dir:/output ghcr.io/openzim/devdocs devdocs2zim --first=2

One of the following flags is required:

  • --all: Fetch all Devdocs resources, and produce one ZIM per resource.
  • --slug SLUG: Fetch the provided Devdocs resource. Slugs are the first path entry in the Devdocs URL. For example, the slug for: https://devdocs.io/gcc~12/ is gcc~12. Use --slug several times to add multiple.
  • --first N: Fetch the first number of items per slug as shown in the DevDocs UI.

Optional Flags:

  • --skip-slug-regex REGEX: Skips slugs matching the given regular expression.
  • --output OUTPUT_FOLDER: Output folder for ZIMs. Default: /output
  • --creator CREATOR: Name of content creator. Default: 'DevDocs'
  • --publisher PUBLISHER: Custom publisher name. Default: 'openZIM'
  • --name-format FORMAT: Custom name format for individual ZIMs. Default: 'devdocs_{slug_without_version}_{version}'
  • --title-format FORMAT: Custom title format for individual ZIMs. Value will be truncated to 30 chars. Default: '{full_name} Documentation'
  • --description-format FORMAT: Custom description format for individual ZIMs. Value will be truncated to 80 chars. Default: '{full_name} Documentation'
  • --long-description-format FORMAT: Custom long description format for your ZIM. Value will be truncated to 4000 chars.Default: '{full_name} documentation by DevDocs'
  • --tag TAG: Add tag to the ZIM. Use --tag several times to add multiple. Formatting is supported. Default: ['devdocs', '{slug_without_version}']

Formatting Placeholders

The following formatting placeholders are supported:

  • {name}: Human readable name of the resource e.g. Python.
  • {full_name}: Name with optional version for the resource e.g. Python 3.12.
  • {slug}: Devdocs slug for the resource e.g. python~3.12.
  • {clean_slug}: Slug with non alphanumeric/period characters replaced with - e.g. python-3.12.
  • {slug_without_version}: Devdocs slug for the resource without the version e.g. python.
  • {version}: Shortened version displayed in devdocs, if any e.g. 3.12.
  • {release}: Specific release of the software the documentation is for, if any e.g. 3.12.1.
  • {attribution}: License and attribution information about the resource.
  • {home_link}: Link to the project's home page, if any: e.g. https://python.org.
  • {code_link}: Link to the project's source, if any: e.g. https://github.com/python/cpython.
  • {period}: The current date in YYYY-MM format e.g. 2024-02.

Developing

Use the commands below to set up the project once:

# Install hatch if it isn't installed already. pip install hatch

# Local install (in default env) / re-sync packages hatch run pip list

# Set-up pre-commit pre-commit install

The following commands can be used to build and test the scraper:

# Show scripts hatch env show

# linting, testing, coverage, checking hatch run lint:all
❯ hatch run lint:fixall

# run tests on all matrixed' envs hatch run test:run

# run tests in a single matrixed' env hatch env run -e test -i py=3.12 coverage

# run static type checks hatch env run check:all

# building packages hatch build

Contributing

This project adheres to openZIM's Contribution Guidelines.

This project has implemented openZIM's Python bootstrap, conventions and policies v1.0.3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

devdocs2zim-0.1.0.tar.gz (42.1 kB view hashes)

Uploaded Source

Built Distribution

devdocs2zim-0.1.0-py3-none-any.whl (36.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page