Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client, web search, hot topic scrapers, LLM extraction pipeline
Project description
chinese-scraper-utils
Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, web search, hot topic scrapers, LLM-powered event extraction, and a DeepSeek API client.
Extracted from ComiRadar and weekly-hotspot.
Installation / 安装
# Core (zero extra dependencies beyond httpx + openai)
pip install chinese-scraper-utils
# With web search support
pip install chinese-scraper-utils[search]
Quick Start / 快速开始
from chinese_scraper_utils import (
# Core utilities
parse_date, extract_date, extract_city, guess_category, stable_id, random_ua,
# Web search
search_web,
# Hot topic scrapers
scrape_weibo_hot, scrape_zhihu_hot,
# LLM extraction
DeepSeekClient, EventExtractor,
)
# Parse Chinese dates
parse_date("2026/05/20") # "2026-05-20"
extract_date("5月4日上海有漫展") # "2026-05-04"
# Extract cities (false-positive protected)
extract_city("活动在上海举办") # "上海"
extract_city("西安路有个活动") # "" (not a city!)
# Scrape hot topics
weibo = scrape_weibo_hot() # list[HotTopic]
zhihu = scrape_zhihu_hot() # list[HotTopic]
# LLM-powered event extraction
client = DeepSeekClient(api_key="sk-xxx")
extractor = EventExtractor(
client=client,
event_types=["漫展", "同人展", "演唱会"],
min_confidence=0.5,
)
events = extractor.extract(["五一北京漫展嘉年华在国家会议中心..."])
# CLI usage
# python -m chinese_scraper_utils search "五一漫展"
# python -m chinese_scraper_utils scrape-weibo
# python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会"
API Reference / API 参考
| Export / 导出项 | Type | Description |
|---|---|---|
parse_date(s) |
str → str |
Structured date parsing |
try_parse_date(s) |
str → str|None |
Same, returns None on failure |
extract_date(text) |
str → str |
Chinese text date extraction |
CITIES |
list[str] |
50 major Chinese cities |
extract_city(text, extra_cities=None) |
str → str |
City name extraction (false-positive safe) |
normalize_city(city) |
str → str |
City name normalization |
CATEGORY_ALIASES |
dict[str,str] |
Category alias mapping |
guess_category(title) |
str → str |
Event category guessing (longest-match) |
UA_POOL |
list[str] |
21 modern User-Agent strings |
random_ua() |
→ str |
Random UA selection |
stable_id(*parts) |
str → str |
Deterministic SHA256 short ID |
RateLimiter |
class | Async rate limiter with retry + jitter |
DeepSeekClient |
class | DeepSeek API client (sync/async, retry) |
NEW SearchResult |
dataclass | Web search result (title/url/snippet) |
NEW search_web(query, n) |
→ list[SearchResult] |
DuckDuckGo web search |
NEW HotTopic |
dataclass | Unified hot topic (title/summary/url/source) |
NEW scrape_weibo_hot() |
→ list[HotTopic] |
Weibo hot search |
NEW scrape_zhihu_hot() |
→ list[HotTopic] |
Zhihu hot list |
NEW scrape_hackernews_top() |
→ list[HotTopic] |
HN top stories |
NEW ExtractedEvent |
dataclass | Structured event (with source tracing) |
NEW EventExtractor |
class | 5-stage LLM extraction pipeline + cache |
NEW extract_events(texts, ...) |
→ list[ExtractedEvent] |
Convenience extractor |
ScraperError / RateLimitError / etc. |
Exception classes | Typed error hierarchy |
Full API docs: API_REFERENCE.md
CLI / 命令行
python -m chinese_scraper_utils search "五一北京漫展" -n 10
python -m chinese_scraper_utils scrape-weibo
python -m chinese_scraper_utils scrape-zhihu
python -m chinese_scraper_utils scrape-hn
python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会" -c 0.5 -v
Related / 相关项目
- ComiRadar — Anime event scraper using this library
- weekly-hotspot — Weekly hot topics analysis
License / 许可证
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chinese_scraper_utils-0.2.1.tar.gz.
File metadata
- Download URL: chinese_scraper_utils-0.2.1.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2055cf46042a0de6aae47abd667f2e59bbf7ce18b073d0a571dd3aa7e316c971
|
|
| MD5 |
af7430e3e37af04ea3eaff1cd0412ed2
|
|
| BLAKE2b-256 |
1b44d31d04493af58d49ca773fdd4eb28ee9a57b03c39c5335cca4daaeca6778
|
File details
Details for the file chinese_scraper_utils-0.2.1-py3-none-any.whl.
File metadata
- Download URL: chinese_scraper_utils-0.2.1-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c80ddf64c293d1f48c9e4b646799666a556aa766fef6470bf26cf0600056bbb
|
|
| MD5 |
93e4eed510b598453f8bced4cec457dd
|
|
| BLAKE2b-256 |
69e1fac1ef77e34fa3fee6a31479ccaff3a4325d61c802204d2f70598cdd4de0
|