Make ZIM file from Gutenberg books

These details have not been verified by PyPI

Project links

Project description

Gutenberg Offline

This scraper downloads the whole Project Gutenberg library and puts it in a ZIM file, a clean and user friendly format for storing content for offline usage.

[!WARNING] This scraper is now known to have a serious flaw. A critical bug https://github.com/openzim/gutenberg/issues/219 has been discovered which leads to incomplete archives. Work on https://github.com/openzim/gutenberg/issues/97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).

Getting Started

The recommended way to run the Gutenberg scraper is using Docker, as it comes with all required dependencies pre-installed.

Running with Docker

Run the scraper with Docker:

docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim

The -v $(pwd)/output:/output option mounts the output folder in your current directory to the /output folder inside the container (which is the working directory). This ensures that the ZIM file is saved to your local machine.

Show available options:

To view all the available options for gutenberg2zim, run:

docker run ghcr.io/openzim/gutenberg:latest gutenberg2zim --help

Arguments

Customize the content download with the following options. For example, to download books in English or French with IDs 100 to 200 and only in PDF format:

docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search

This will download books in English and French that have the Id 100 to 200 in the HTML (default) and PDF format. The -it flags allow you to see progress. The --rm flag removes the container after completion.

You can find the full arguments list below:

-h --help                       Display this help message
-y --wipe-db                    Empty cached book metadata
-F --force                      Redo step even if target already exist

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-L --zim-long-desc=<description> Set ZIM long description
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for processing tasks
--dlc=<nb>                      Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
-m --one-language-one-zim=<folder> When more than 1 language, do one zim for each   language (and one with all)
--no-index                      Do NOT create full-text index within ZIM file
--prepare                       Download rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--zim                           Create a ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--bookshelves                   Add bookshelves
--optimization-cache=<url>      URL with credentials to S3 bucket for using as optimization cache
--use-any-optimized-version     Try to use any optimized version found on optimization cache

Contributing Code

Main coding guidelines are from the openZIM Wiki.

Setting Up the Environment

Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.

Install the dependencies

First, ensure you use the proper Python version, inline with the requirement of pyproject.toml (you might for instance use pyenv to manage multiple Python versions in parallel).

You then need to install the various tools/libraries needed by the scraper.

The setup is divided into two categories: one for simply running the scraper and another for setting up a development environment for contributing and making improvements

For Users Running the Scraper:

GNU/Linux

sudo apt update && sudo apt install -y python3-pip zim-tools

Fedora

sudo dnf install -y python3-pip zim-tools

Arch linux

sudo pacman -S python-pip zim-tools

macOS

brew install zim-tools

For Developers Contributing & Modifying;

GNU/Linux

sudo apt update && sudo apt install -y python3-pip zim-tools

Fedora

sudo dnf install -y python3-pip zim-tools

Arch linux

sudo pacman -S python-pip zim-tools

macOS

brew install zim-tools

Setup the package

First, clone this repository.

git clone git@github.com:openzim/gutenberg.git
cd gutenberg

If you do not already have it on your system, install hatch to build the software and manage virtual environments (you might be interested by our detailed Developer Setup as well).

pip3 install hatch

Start a hatch shell: this will install software including dependencies in an isolated virtual environment.

hatch shell

That's it. You can now run gutenberg2zim from your terminal.

Screenshots

License

GPLv3 or later, see LICENSE for more details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.0

Jun 6, 2025

2.1.1

Jan 17, 2024

2.1.0

Aug 18, 2023

2.0.0

Feb 22, 2023

1.1.9

Mar 11, 2022

1.1.8

Aug 2, 2021

1.1.7

Jul 28, 2021

1.1.6

Jun 10, 2021

1.1.5

Jul 13, 2020

1.1.4

Sep 19, 2019

1.1.3.0

Sep 9, 2019

1.1.2

Oct 14, 2018

1.1.2b0 pre-release

Oct 14, 2018

1.1.1

Jul 6, 2018

1.1.0

Jun 22, 2018

1.0.7

Apr 8, 2017

1.0.6

Apr 8, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gutenberg2zim-2.2.0.tar.gz (2.0 MB view details)

Uploaded Jun 6, 2025 Source

Built Distribution

gutenberg2zim-2.2.0-py3-none-any.whl (1.9 MB view details)

Uploaded Jun 6, 2025 Python 3

File details

Details for the file gutenberg2zim-2.2.0.tar.gz.

File metadata

Download URL: gutenberg2zim-2.2.0.tar.gz
Upload date: Jun 6, 2025
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for gutenberg2zim-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cc08c10497b25fe84be6ff275b0c34206b14fc91119f7600b54fe90b3ff9ede2`
MD5	`8b68354662ad6edd1556b334943c2575`
BLAKE2b-256	`e342e40078e26df6ae1f9d19d7ef894a5380e2784ddf5fafd88bff258945b51e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gutenberg2zim-2.2.0.tar.gz:

Publisher: Publish.yaml on openzim/gutenberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gutenberg2zim-2.2.0.tar.gz
- Subject digest: cc08c10497b25fe84be6ff275b0c34206b14fc91119f7600b54fe90b3ff9ede2
- Sigstore transparency entry: 231069752
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: openzim/gutenberg@9c7f2ef61b6e31087b13d0c3e3fd9ea447c7977a
- Branch / Tag: refs/tags/v2.2.0
- Owner: https://github.com/openzim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: Publish.yaml@9c7f2ef61b6e31087b13d0c3e3fd9ea447c7977a
- Trigger Event: release

File details

Details for the file gutenberg2zim-2.2.0-py3-none-any.whl.

File metadata

Download URL: gutenberg2zim-2.2.0-py3-none-any.whl
Upload date: Jun 6, 2025
Size: 1.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for gutenberg2zim-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6673f8c3d9f3b1304487debc1ee88c3ee102875cf737b6b98b3f7515fd76ad4`
MD5	`6f58392825f70ed9cbb2bff8abccbfac`
BLAKE2b-256	`8a83125a04841c519ed4e3baca3a22945c7ce833c98f4f3bea5990fd46881aa7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gutenberg2zim-2.2.0-py3-none-any.whl:

Publisher: Publish.yaml on openzim/gutenberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gutenberg2zim-2.2.0-py3-none-any.whl
- Subject digest: d6673f8c3d9f3b1304487debc1ee88c3ee102875cf737b6b98b3f7515fd76ad4
- Sigstore transparency entry: 231069759
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: openzim/gutenberg@9c7f2ef61b6e31087b13d0c3e3fd9ea447c7977a
- Branch / Tag: refs/tags/v2.2.0
- Owner: https://github.com/openzim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: Publish.yaml@9c7f2ef61b6e31087b13d0c3e3fd9ea447c7977a
- Trigger Event: release

gutenberg2zim 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gutenberg Offline

Getting Started

Running with Docker

Arguments

Contributing Code

Setting Up the Environment

Install the dependencies

GNU/Linux

Fedora

Arch linux

macOS

GNU/Linux

Fedora

Arch linux

macOS

Setup the package

Screenshots

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance