Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client, web search, hot topic scrapers, LLM extraction pipeline
Project description
chinese-scraper-utils
Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, and a DeepSeek API client. Extracted from ComiRadar and weekly-cli, where these functions had diverged across two codebases.
从 ComiRadar 与 weekly-cli 中提取的共享 Python 工具集:中文日期解析、城市提取、稳定 ID 生成、UA 池、速率限制、DeepSeek 客户端。
Installation / 安装
pip install chinese-scraper-utils
Usage / 使用示例
Stable ID / 稳定 ID
from chinese_scraper_utils import stable_id
uid = stable_id("北京国际动漫展", "北京", "2026-05-04")
# => "3a8f1c9e2d4b6a05" (SHA256 hex prefix, deterministic across restarts)
Date Parsing / 日期解析
from chinese_scraper_utils import parse_date, extract_date
# Structured date parsing / 结构化日期解析
parse_date("2026-05-04") # => "2026-05-04"
parse_date("2026/05/04 14:30:00") # => "2026-05-04"
# Chinese text date extraction / 中文文本日期提取
extract_date("5月4日上海有漫展") # => "2026-05-04"
extract_date("2026年5月4日-6日") # => "2026-05-04"
City Extraction & Normalization / 城市提取与规范化
from chinese_scraper_utils import extract_city, normalize_city, CITIES
extract_city("活动在上海举办") # => "上海"
extract_city("广州天河区") # => "广州"
normalize_city("上海市") # => "上海"
normalize_city(" 深圳市 ") # => "深圳"
Category Guessing / 类别猜测
from chinese_scraper_utils import guess_category
guess_category("五一漫展嘉年华") # => "漫展"
guess_category("初音未来演唱会") # => "演唱会"
guess_category("清明上河图展览") # => "展览"
Random User-Agent / 随机 UA
from chinese_scraper_utils import random_ua, UA_POOL
random_ua() # => "Mozilla/5.0 (Windows NT 10.0; ..."
Async Rate Limiter / 异步速率限制
import asyncio
from chinese_scraper_utils import RateLimiter
limiter = RateLimiter(min_interval=1.0)
async def fetch():
async with httpx.AsyncClient() as client:
resp = await limiter.fetch_with_retry(
lambda: client.get("https://example.com")
)
return resp.text
DeepSeek AI Client / DeepSeek AI 客户端
from chinese_scraper_utils import DeepSeekClient
client = DeepSeekClient(api_key="sk-xxx")
result = client.chat_json([
{"role": "user", "content": "提取活动信息:北京五一漫展"}
])
# => {"name": "...", "date": "...", "city": "..."}
API Reference / API 参考
| Export / 导出项 | Type / 类型 | Description / 描述 |
|---|---|---|
stable_id(*parts) |
str → str |
Deterministic SHA256 short ID / 确定性 SHA256 短 ID |
parse_date(s) |
str → str |
Structured date parsing / 结构化日期解析 |
extract_date(text) |
str → str |
Chinese text date extraction / 中文文本日期提取 |
CITIES |
list[str] |
52 major Chinese cities / 52 个主要中国城市 |
extract_city(text) |
str → str |
Chinese city name extraction / 城市名提取 |
normalize_city(city) |
str → str |
City name normalization (strip suffix) / 城市名规范化 |
CATEGORY_ALIASES |
dict[str, str] |
Category alias mapping / 类别别名映射 |
guess_category(title) |
str → str |
Category guessing from title / 根据标题猜测类别 |
UA_POOL |
list[str] |
User-Agent pool / User-Agent 池 |
random_ua() |
→str` |
Random UA selection / 随机返回 UA |
RateLimiter |
class | Async rate limiter with retry / 异步速率限制器 |
DeepSeekClient |
class | DeepSeek API wrapper / DeepSeek API 封装客户端 |
Related / 相关项目
- ComiRadar — Anime event scraper using this library
- weekly-hotspot — Weekly hot topics analysis using this library
License / 许可证
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chinese_scraper_utils-0.2.0.tar.gz.
File metadata
- Download URL: chinese_scraper_utils-0.2.0.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
736da65a1c9599539e67ea963dac99f87c478c1d9110285e4965767c865cea79
|
|
| MD5 |
bc23eb1872729d16dcc55c13a3a37b48
|
|
| BLAKE2b-256 |
585d1917e8f1293f82bd5f5b3a594a3b0b7dc5a82b502b7d896050af76d57ca5
|
File details
Details for the file chinese_scraper_utils-0.2.0-py3-none-any.whl.
File metadata
- Download URL: chinese_scraper_utils-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d655dbec5399d579a523c6c04aeb661064d5591c92f678b0dda44415cc23c20a
|
|
| MD5 |
738188e6877c6c12b886ae0f78e977e2
|
|
| BLAKE2b-256 |
dcdfac7cf00fb154431d0d4cce3707d0847fd334dda4977d85d39db57c23c62d
|