EpubCrawler，用于抓取网页内容并制作 EPUB 的小工具

These details have not been verified by PyPI

Project links

Homepage

Project description

epub-crawler

用于抓取网页内容并制作 EPUB 的小工具。

安装

通过 pip（推荐）：

pip install EpubCrawler

从源码安装：

pip install git+https://github.com/apachecn/epub-crawler

使用指南

crawl-epub [CONFIG]

CONFIG: JSON 格式的配置文件，默认为当前工作目录中的 config.json

配置文件包含以下属性：

name: String

元信息中的书籍名称，也是在当前工作目录中保存文件的名称
url: String（和list二选一）

目录页面的 URL
link: String（若url非空则必填）

链接<a>的选择器
list: [String]（和url二选一）

待抓取页面的列表，如果这个列表不为空，则抓取这个列表

⚠该配置项会覆盖url、link和external⚠
title: String（可空）

文章页面的标题选择器（默认为title）
content: String（可空）

文章页面的内容选择器，为空则智能分析
remove: String（可空）

文章页面需要移除的元素的选择器
credit: Boolean（可空）

是否显示原文链接
headers: {String: String}（可空）

HTTP 请求的协议头，默认为{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
retry: Integer（可空）

HTTP 请求的重试次数，默认为 10
wait: Float（可空）

两次请求之间的间隔（秒），默认为 0
timeout: Integer（可空）

同时设置 HTTP 请求的连接和读取超时（秒）

⚠会覆盖connTimeout和readTimeout
connTimeout: Integer（可空）

HTTP 请求的连接超时（秒），默认为 1
readTimeout: Integer（可空）

HTTP 请求的读取超时（秒），默认为 60
encoding: String（可空）

网页编码，默认为 UTF-8
optiMode: String（可空）

图片处理的模型，'none'表示不处理，其它值请见 imgyaso 支持的模式，默认为'quant'
colors: Integer（可空）

imgyaso 接收的colors参数，默认为 8
imgSrc: [String]（可空）

图片源的属性，默认为["data-src", "data-original-src", "src"]
proxy: String（可空）

要使用的代理，格式为<protocal>://<host>:<port>
checkStatus: Bool（可空）

是否检查状态码。如果为true并且状态码非 2XX，当作失败。默认为False。
textThreads: Integer（可空）

爬取文本的线程数，默认为 5
imgThreads: Integer（可空）

爬取图片的线程数，默认为 5
external: String（可空）

外部脚本的路径。脚本中可定义get_toc和get_article函数来自定义获取目录和正文的逻辑。

get_toc(html: string, url: string): [string]

接受页面 HTML 和 URL，返回目录列表

get_article(html: string, url: string): {'title': string, 'content': string}

接受页面 HTML 和 URL，返回字典，title键是标题，content键是正文

⚠该配置项会覆盖link，title和content，但不会覆盖list⚠
sizeLimit：String（可空）

EPUB 大小限制，格式为【数字+字母单位】，默认为100m。

用于抓取我们的 PyTorch 1.4 文档的示例：

{
    "name": "PyTorch 1.4 中文文档 & 教程",
    "url": "https://gitee.com/apachecn/pytorch-doc-zh/blob/master/docs/1.4/SUMMARY.md",
    "link": ".markdown-body li a",
    "remove": "a.anchor",
    "headers": {"Referer": "https://gitee.com/"}
}

协议

本项目基于 SATA 协议发布。

您有义务为此开源项目点赞，并考虑额外给予作者适当的奖励。

赞助我们

另见

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2023.7.9.2

Jul 9, 2023

2023.7.9.1

Jul 9, 2023

2023.7.9.0

Jul 9, 2023

2023.3.14.0

Mar 14, 2023

2023.3.13.1

Mar 13, 2023

2023.3.13.0

Mar 13, 2023

2023.2.14.3

Mar 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EpubCrawler-2023.7.9.2.tar.gz (11.0 kB view details)

Uploaded Jul 9, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

EpubCrawler-2023.7.9.2-py3-none-any.whl (14.2 kB view details)

Uploaded Jul 9, 2023 Python 3

File details

Details for the file EpubCrawler-2023.7.9.2.tar.gz.

File metadata

Download URL: EpubCrawler-2023.7.9.2.tar.gz
Upload date: Jul 9, 2023
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for EpubCrawler-2023.7.9.2.tar.gz
Algorithm	Hash digest
SHA256	`ffdc2c66ced9d371c2625d104eaa579184175421e108af9e4a0aef2178507e19`
MD5	`fcd11afb69fa26893e53e226316640f6`
BLAKE2b-256	`192c3dff1b707a29b36d3e6c12e20ae3580d2a90ef13e80a48a5a15cf8288f05`

See more details on using hashes here.

File details

Details for the file EpubCrawler-2023.7.9.2-py3-none-any.whl.

File metadata

Download URL: EpubCrawler-2023.7.9.2-py3-none-any.whl
Upload date: Jul 9, 2023
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for EpubCrawler-2023.7.9.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e956eb057387816d6c72767d57b2b9923d4723d100bebd029de82ec92fd55772`
MD5	`8eef2d98ea639a0900a1a264fb20d015`
BLAKE2b-256	`9f69644c775278f637ca2aaeeec2179617eda4bc3bcccc1e27f4042501edcb68`

See more details on using hashes here.

EpubCrawler 2023.7.9.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

epub-crawler

安装

使用指南

协议

赞助我们

另见

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes