Skip to main content

Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client, web search, hot topic scrapers, LLM extraction pipeline

Project description

chinese-scraper-utils

PyPI version Python 3.11+ MIT License

Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, web search, hot topic scrapers, LLM-powered event extraction, and a DeepSeek API client.

Extracted from ComiRadar and weekly-hotspot.


Installation / 安装

# Core (zero extra dependencies beyond httpx + openai)
pip install chinese-scraper-utils

# With web search support
pip install chinese-scraper-utils[search]

Quick Start / 快速开始

from chinese_scraper_utils import (
    # Core utilities
    parse_date, extract_date, extract_city, guess_category, stable_id, random_ua,
    # Web search
    search_web,
    # Hot topic scrapers
    scrape_weibo_hot, scrape_zhihu_hot,
    # LLM extraction
    DeepSeekClient, EventExtractor,
)

# Parse Chinese dates
parse_date("2026/05/20")         # "2026-05-20"
extract_date("5月4日上海有漫展")   # "2026-05-04"

# Extract cities (false-positive protected)
extract_city("活动在上海举办")     # "上海"
extract_city("西安路有个活动")     # "" (not a city!)

# Scrape hot topics
weibo = scrape_weibo_hot()       # list[HotTopic]
zhihu = scrape_zhihu_hot()       # list[HotTopic]

# LLM-powered event extraction
client = DeepSeekClient(api_key="sk-xxx")
extractor = EventExtractor(
    client=client,
    event_types=["漫展", "同人展", "演唱会"],
    min_confidence=0.5,
)
events = extractor.extract(["五一北京漫展嘉年华在国家会议中心..."])

# CLI usage
# python -m chinese_scraper_utils search "五一漫展"
# python -m chinese_scraper_utils scrape-weibo
# python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会"

API Reference / API 参考

Export / 导出项 Type Description
parse_date(s) str → str Structured date parsing
try_parse_date(s) str → str|None Same, returns None on failure
extract_date(text) str → str Chinese text date extraction
CITIES list[str] 50 major Chinese cities
extract_city(text, extra_cities=None) str → str City name extraction (false-positive safe)
normalize_city(city) str → str City name normalization
CATEGORY_ALIASES dict[str,str] Category alias mapping
guess_category(title) str → str Event category guessing (longest-match)
UA_POOL list[str] 21 modern User-Agent strings
random_ua() → str Random UA selection
stable_id(*parts) str → str Deterministic SHA256 short ID
RateLimiter class Async rate limiter with retry + jitter
DeepSeekClient class DeepSeek API client (sync/async, retry)
NEW SearchResult dataclass Web search result (title/url/snippet)
NEW search_web(query, n) → list[SearchResult] DuckDuckGo web search
NEW HotTopic dataclass Unified hot topic (title/summary/url/source)
NEW scrape_weibo_hot() → list[HotTopic] Weibo hot search
NEW scrape_zhihu_hot() → list[HotTopic] Zhihu hot list
NEW scrape_hackernews_top() → list[HotTopic] HN top stories
NEW ExtractedEvent dataclass Structured event (with source tracing)
NEW EventExtractor class 5-stage LLM extraction pipeline + cache
NEW extract_events(texts, ...) → list[ExtractedEvent] Convenience extractor
ScraperError / RateLimitError / etc. Exception classes Typed error hierarchy

Full API docs: API_REFERENCE.md


CLI / 命令行

python -m chinese_scraper_utils search "五一北京漫展" -n 10
python -m chinese_scraper_utils scrape-weibo
python -m chinese_scraper_utils scrape-zhihu
python -m chinese_scraper_utils scrape-hn
python -m chinese_scraper_utils extract posts.json -t "漫展,演唱会" -c 0.5 -v

Related / 相关项目

License / 许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_scraper_utils-0.2.3.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chinese_scraper_utils-0.2.3-py3-none-any.whl (27.0 kB view details)

Uploaded Python 3

File details

Details for the file chinese_scraper_utils-0.2.3.tar.gz.

File metadata

  • Download URL: chinese_scraper_utils-0.2.3.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for chinese_scraper_utils-0.2.3.tar.gz
Algorithm Hash digest
SHA256 bb95ecc4578b20e1bdd7c8c9cfa6b6e9b45a5f6a14c4392129b290b461fb2c15
MD5 6e7fa93751c68612a309e40f4cb29edc
BLAKE2b-256 20310102698fc586cfc4651c45d7bc025f5dd700027d21bc9711fe4168e558d3

See more details on using hashes here.

File details

Details for the file chinese_scraper_utils-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for chinese_scraper_utils-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2963b7f85ed5b2d9b0bbf8b09a18ca2af7469aa54be7aeb1d9bbe420b8e58e49
MD5 136dab6b6a945bf8921e6e08526bbf8b
BLAKE2b-256 46db0a621fa25043f9a0dcd05caad5c13424154882923d47e15f9e795e5a4541

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page