A web crawler to fetch web novel chapters and generate a PDF.
Project description
PageWeaver
This project is a CLI tool designed to crawl web novels from FreeWebNovel and generate a PDF document containing the chapters. The tool uses Python libraries such as requests
, BeautifulSoup
, and pylatex
to fetch, process, and compile the novel content into a well-formatted PDF.
Features
- Fetches novel chapters from FreeWebNovel.
- Processes and cleans the text to remove non-UTF8 characters.
- Generates a PDF document with a title page, table of contents, and chapters.
- Supports multi-threaded crawling for faster processing.
- Option to allow non-English characters in the novel title and author name.
Requirements
- Python 3.9+
requests
beautifulsoup4
pylatex
argparse
Installation
Via pip
pip install pageweaver
Via source
git clone https://github.com/KTS-o7/pageweaver.git
cd pageweaver
pip install -r requirements.txt
python setup.py install
Usage
pageweaver <novel_url> <start_chapter_number> <end_chapter_number> [--output_dir <output_dir>] [--num-workers <num_workers>] [--allow-non-english]
Arguments
novel_url
: The FreeWebNovel URL of the novel to crawl.start_chapter
: The starting chapter number.end_chapter
: The ending chapter number.--output_dir
: (Optional) The destination directory for the generated PDF. Defaults to the current working directory.--num-workers
: (Optional) The number of workers to use for crawling. Defaults to 10.--allow-non-english
: (Optional) Allow non-English characters in the novel title and author name.
Example Usage
pageweaver https://freewebnovel.com/global-fog-survival.html 1 15 --num-workers 5
pageweaver https://freewebnovel.com/global-fog-survival.html 1 30 --output_dir /path/to/output --allow-non-english
How It Works
- WebCrawler: Fetches the HTML content of the novel chapters and extracts the text.
- TextProcessor: Cleans the text by removing non-UTF8 characters and escaping LaTeX special characters.
- DocumentGenerator: Uses pylatex to create a PDF document with the novel content.
- NovelCrawlerService: Manages the crawling process, coordinates the fetching and processing of chapters, and generates the final PDF.
Example
To crawl the novel "Global Fog Survival" from chapters 1 to 2 and generate a PDF, run:
pageweaver https://freewebnovel.com/global-fog-survival.html 1 2 --num-workers 10
This will create a PDF document in the current working directory with the title and author extracted from the novel's metadata.
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
Contact
For any questions or support, please open an issue on the GitHub repository.
Disclaimer
This tool is not intended to promote piracy. It should be used for educational or personal reading purposes only. Please respect the copyrights of the original authors and publishers.
Authors
Star Graph
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pageweaver-1.1.0.tar.gz
.
File metadata
- Download URL: pageweaver-1.1.0.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2bd6d8c6693b93d20c69543483420b6219fdae7ab0684becdde36a0b7c45cfa5 |
|
MD5 | a8c0e3f9bf608eb213d67b64c51a527b |
|
BLAKE2b-256 | b45527f97a464cc9e651e048bd6dbc0dd45b81a980e2f3db0a8b85a7b512027a |
Provenance
File details
Details for the file pageweaver-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: pageweaver-1.1.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bca6a4c04b6f2d2448d26384d4f17025e2b5ccf745afb0f6435891862fa18d35 |
|
MD5 | 9579af5243d02d7805f1157ab5babe3d |
|
BLAKE2b-256 | 41e51168b1a58e943f1518f632b1e776818caf10420cb2814d2722a79c087d5d |