Skip to main content

Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client, web search, hot topic scrapers, LLM extraction pipeline

Project description

chinese-scraper-utils

PyPI version Python 3.11+ MIT License

Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, web search, hot topic scrapers, LLM-powered event extraction, and a DeepSeek API client.

Extracted from ComiRadar and weekly-hotspot.


Installation / 安装

# Core (zero extra dependencies beyond httpx + openai)
pip install chinese-scraper-utils

# With web search support
pip install chinese-scraper-utils[search]

Quick Start / 快速开始

from chinese_scraper_utils import (
    # Core utilities
    parse_date, extract_date, extract_city, guess_category, stable_id, random_ua,
    # Web search
    search_web,
    # Hot topic scrapers
    scrape_weibo_hot, scrape_zhihu_hot,
    # LLM extraction
    DeepSeekClient, EventExtractor,
)

# Parse Chinese dates
parse_date("2026/05/20")         # "2026-05-20"
extract_date("5月4日上海有漫展")   # "2026-05-04"

# Extract cities (false-positive protected)
extract_city("活动在上海举办")     # "上海"
extract_city("西安路有个活动")     # "" (not a city!)

# Scrape hot topics
weibo = scrape_weibo_hot()       # list[HotTopic]
zhihu = scrape_zhihu_hot()       # list[HotTopic]

# LLM-powered event extraction
client = DeepSeekClient(api_key="sk-xxx")
extractor = EventExtractor(
    client=client,
    event_types=["漫展", "同人展", "演唱会"],
    min_confidence=0.5,
)
events = extractor.extract(["五一北京漫展嘉年华在国家会议中心..."])

# CLI usage
# python -m chinese_scraper_utils search "五一漫展"
# python -m chinese_scraper_utils scrape-weibo
# python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会"

API Reference / API 参考

Export / 导出项 Type Description
parse_date(s) str → str Structured date parsing
try_parse_date(s) str → str|None Same, returns None on failure
extract_date(text) str → str Chinese text date extraction
CITIES list[str] 50 major Chinese cities
extract_city(text, extra_cities=None) str → str City name extraction (false-positive safe)
normalize_city(city) str → str City name normalization
CATEGORY_ALIASES dict[str,str] Category alias mapping
guess_category(title) str → str Event category guessing (longest-match)
UA_POOL list[str] 21 modern User-Agent strings
random_ua() → str Random UA selection
stable_id(*parts) str → str Deterministic SHA256 short ID
RateLimiter class Async rate limiter with retry + jitter
DeepSeekClient class DeepSeek API client (sync/async, retry)
NEW SearchResult dataclass Web search result (title/url/snippet)
NEW search_web(query, n) → list[SearchResult] DuckDuckGo web search
NEW HotTopic dataclass Unified hot topic (title/summary/url/source)
NEW scrape_weibo_hot() → list[HotTopic] Weibo hot search
NEW scrape_zhihu_hot() → list[HotTopic] Zhihu hot list
NEW scrape_hackernews_top() → list[HotTopic] HN top stories
NEW ExtractedEvent dataclass Structured event (with source tracing)
NEW EventExtractor class 5-stage LLM extraction pipeline + cache
NEW extract_events(texts, ...) → list[ExtractedEvent] Convenience extractor
ScraperError / RateLimitError / etc. Exception classes Typed error hierarchy

Full API docs: API_REFERENCE.md


CLI / 命令行

python -m chinese_scraper_utils search "五一北京漫展" -n 10
python -m chinese_scraper_utils scrape-weibo
python -m chinese_scraper_utils scrape-zhihu
python -m chinese_scraper_utils scrape-hn
python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会" -c 0.5 -v

Related / 相关项目

License / 许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_scraper_utils-0.2.2.tar.gz (31.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chinese_scraper_utils-0.2.2-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file chinese_scraper_utils-0.2.2.tar.gz.

File metadata

  • Download URL: chinese_scraper_utils-0.2.2.tar.gz
  • Upload date:
  • Size: 31.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for chinese_scraper_utils-0.2.2.tar.gz
Algorithm Hash digest
SHA256 366df0d3918ddaade946ce2399a1c6720923f02292c07f108ea1d6de4f8e69b1
MD5 07557942a4e0a098dc8f2c588982856e
BLAKE2b-256 0e405ff2070616f19642be81b760df04961c4bb8d9d2a1a6610a9c6b44801d9c

See more details on using hashes here.

File details

Details for the file chinese_scraper_utils-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for chinese_scraper_utils-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 96981e7693bbeb6c3c53aed9998bed3476036623a3f4f1263a7f1d0c61f7fd49
MD5 73eaa1d6178e15d2258d852aa90575b7
BLAKE2b-256 46caffcd31d3d5045d5b1f00ae5a932c13a368de2ce07d9e953762a5e3a50230

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page