Skip to main content

A web crawler to fetch web novel chapters and generate a PDF.

Project description

PageWeaver

Python PyPI License ViewCount

This project is a CLI tool designed to crawl web novels from FreeWebNovel and generate a PDF document containing the chapters. The tool uses Python libraries such as requests, BeautifulSoup, and pylatex to fetch, process, and compile the novel content into a well-formatted PDF.

Features

  • Fetches novel chapters from FreeWebNovel.
  • Processes and cleans the text to remove non-UTF8 characters.
  • Generates a PDF document with a title page, table of contents, and chapters.
  • Supports multi-threaded crawling for faster processing.
  • Option to allow non-English characters in the novel title and author name.

Requirements

  • Python 3.9+
  • requests
  • beautifulsoup4
  • pylatex
  • argparse

Installation

Via pip

pip install pageweaver

Via source

git clone https://github.com/KTS-o7/pageweaver.git
cd pageweaver
pip install -r requirements.txt
python setup.py install

Usage

pageweaver <novel_url> <start_chapter_number> <end_chapter_number> [--output_dir <output_dir>] [--num-workers <num_workers>] [--allow-non-english]

Arguments

  • novel_url: The FreeWebNovel URL of the novel to crawl.
  • start_chapter: The starting chapter number.
  • end_chapter: The ending chapter number.
  • --output_dir: (Optional) The destination directory for the generated PDF. Defaults to the current working directory.
  • --num-workers: (Optional) The number of workers to use for crawling. Defaults to 10.
  • --allow-non-english: (Optional) Allow non-English characters in the novel title and author name.

Example Usage

pageweaver https://freewebnovel.com/global-fog-survival.html 1 15 --num-workers 5
pageweaver https://freewebnovel.com/global-fog-survival.html 1 30 --output_dir /path/to/output --allow-non-english

How It Works

  • WebCrawler: Fetches the HTML content of the novel chapters and extracts the text.
  • TextProcessor: Cleans the text by removing non-UTF8 characters and escaping LaTeX special characters.
  • DocumentGenerator: Uses pylatex to create a PDF document with the novel content.
  • NovelCrawlerService: Manages the crawling process, coordinates the fetching and processing of chapters, and generates the final PDF.

Example

To crawl the novel "Global Fog Survival" from chapters 1 to 2 and generate a PDF, run:

pageweaver https://freewebnovel.com/global-fog-survival.html 1 2 --num-workers 10

This will create a PDF document in the current working directory with the title and author extracted from the novel's metadata.

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Contact

For any questions or support, please open an issue on the GitHub repository.

Disclaimer

This tool is not intended to promote piracy. It should be used for educational or personal reading purposes only. Please respect the copyrights of the original authors and publishers.

Authors

Star Graph

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageweaver-1.1.0.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

pageweaver-1.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file pageweaver-1.1.0.tar.gz.

File metadata

  • Download URL: pageweaver-1.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for pageweaver-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2bd6d8c6693b93d20c69543483420b6219fdae7ab0684becdde36a0b7c45cfa5
MD5 a8c0e3f9bf608eb213d67b64c51a527b
BLAKE2b-256 b45527f97a464cc9e651e048bd6dbc0dd45b81a980e2f3db0a8b85a7b512027a

See more details on using hashes here.

Provenance

File details

Details for the file pageweaver-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pageweaver-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for pageweaver-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bca6a4c04b6f2d2448d26384d4f17025e2b5ccf745afb0f6435891862fa18d35
MD5 9579af5243d02d7805f1157ab5babe3d
BLAKE2b-256 41e51168b1a58e943f1518f632b1e776818caf10420cb2814d2722a79c087d5d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page