贴吧一体化异步爬取与HTML/LLM数据处理工具箱
Project description
贴吧一体化工具箱 (tieba-toolkit)
本项目将高效的贴吧异步爬虫 (craw3.4.8.py) 和 Apple 风格的 HTML 查看器生成工具 (conv_v2.1.py) 封装成了一个易于安装和使用的 Python 库及命令行工具。
🚀 1. 核心功能与特性
| 模块 | 功能描述 |
|---|---|
| 异步爬虫 | 基于 aiotieba,实现高并发、异步的帖子内容和图片爬取。 |
| 断点续爬 | 自动记录 checkpoint,支持中断后恢复爬取,无需担心进度丢失。 |
| 图片下载 | 多工作线程异步下载帖子中所有图片资源,支持重试机制。 |
| HTML 转换 | 将爬取的原始 JSON 数据转换为单文件、美观的 Apple 风格 HTML 查看器,支持异步加载和分页浏览,便于本地离线预览。 |
| CLI 命令行 | 提供 tieba-cli 命令,统一管理爬取、转换和列表查看功能。 |
⚙️ 2. 安装与环境要求
2.1 环境要求
- Python 3.8+
- 贴吧登录凭证:BDUSS(必须通过环境变量设置)
2.2 安装
使用 pip 即可从 PyPI 安装 tieba-toolkit(假设您已完成发布):
pip install tieba-toolkit
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keaixiaojiycw_tieba_post_crawler-0.6.0.tar.gz.
File metadata
- Download URL: keaixiaojiycw_tieba_post_crawler-0.6.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef9ae313b4e3acb893ee03e077c45d6d6586936909831f42d79b7db9ebd1ce65
|
|
| MD5 |
58c2f64857a9bce098d9e5e01266d017
|
|
| BLAKE2b-256 |
b142e4a5e73cb1b8fa843d4dd0b0c5b8577b3c1dc03c6cde08af1663eaf13166
|
File details
Details for the file keaixiaojiycw_tieba_post_crawler-0.6.0-py3-none-any.whl.
File metadata
- Download URL: keaixiaojiycw_tieba_post_crawler-0.6.0-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20907b5515df0066aad2eaf2612e238527651e727a06923ab20ff6c3e10b31af
|
|
| MD5 |
acf867d0ad28273078db9503b4f55af7
|
|
| BLAKE2b-256 |
e64f6e525040e71444f8c3716289c9202d1da4bb25d1ae3667248c8215f8bb0d
|