Skip to main content

Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client, web search, hot topic scrapers, LLM extraction pipeline

Project description

chinese-scraper-utils

PyPI version Python 3.11+ MIT License

Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, and a DeepSeek API client. Extracted from ComiRadar and weekly-cli, where these functions had diverged across two codebases.

ComiRadarweekly-cli 中提取的共享 Python 工具集:中文日期解析、城市提取、稳定 ID 生成、UA 池、速率限制、DeepSeek 客户端。


Installation / 安装

pip install chinese-scraper-utils

Usage / 使用示例

Stable ID / 稳定 ID

from chinese_scraper_utils import stable_id

uid = stable_id("北京国际动漫展", "北京", "2026-05-04")
# => "3a8f1c9e2d4b6a05"  (SHA256 hex prefix, deterministic across restarts)

Date Parsing / 日期解析

from chinese_scraper_utils import parse_date, extract_date

# Structured date parsing / 结构化日期解析
parse_date("2026-05-04")           # => "2026-05-04"
parse_date("2026/05/04 14:30:00")  # => "2026-05-04"

# Chinese text date extraction / 中文文本日期提取
extract_date("5月4日上海有漫展")       # => "2026-05-04"
extract_date("2026年5月4日-6日")      # => "2026-05-04"

City Extraction & Normalization / 城市提取与规范化

from chinese_scraper_utils import extract_city, normalize_city, CITIES

extract_city("活动在上海举办")      # => "上海"
extract_city("广州天河区")         # => "广州"

normalize_city("上海市")           # => "上海"
normalize_city("  深圳市  ")       # => "深圳"

Category Guessing / 类别猜测

from chinese_scraper_utils import guess_category

guess_category("五一漫展嘉年华")   # => "漫展"
guess_category("初音未来演唱会")   # => "演唱会"
guess_category("清明上河图展览")   # => "展览"

Random User-Agent / 随机 UA

from chinese_scraper_utils import random_ua, UA_POOL

random_ua()  # => "Mozilla/5.0 (Windows NT 10.0; ..."

Async Rate Limiter / 异步速率限制

import asyncio
from chinese_scraper_utils import RateLimiter

limiter = RateLimiter(min_interval=1.0)

async def fetch():
    async with httpx.AsyncClient() as client:
        resp = await limiter.fetch_with_retry(
            lambda: client.get("https://example.com")
        )
        return resp.text

DeepSeek AI Client / DeepSeek AI 客户端

from chinese_scraper_utils import DeepSeekClient

client = DeepSeekClient(api_key="sk-xxx")
result = client.chat_json([
    {"role": "user", "content": "提取活动信息:北京五一漫展"}
])
# => {"name": "...", "date": "...", "city": "..."}

API Reference / API 参考

Export / 导出项 Type / 类型 Description / 描述
stable_id(*parts) strstr Deterministic SHA256 short ID / 确定性 SHA256 短 ID
parse_date(s) strstr Structured date parsing / 结构化日期解析
extract_date(text) strstr Chinese text date extraction / 中文文本日期提取
CITIES list[str] 52 major Chinese cities / 52 个主要中国城市
extract_city(text) strstr Chinese city name extraction / 城市名提取
normalize_city(city) strstr City name normalization (strip suffix) / 城市名规范化
CATEGORY_ALIASES dict[str, str] Category alias mapping / 类别别名映射
guess_category(title) strstr Category guessing from title / 根据标题猜测类别
UA_POOL list[str] User-Agent pool / User-Agent 池
random_ua() str` Random UA selection / 随机返回 UA
RateLimiter class Async rate limiter with retry / 异步速率限制器
DeepSeekClient class DeepSeek API wrapper / DeepSeek API 封装客户端

Related / 相关项目

  • ComiRadar — Anime event scraper using this library
  • weekly-hotspot — Weekly hot topics analysis using this library

License / 许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_scraper_utils-0.2.0.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chinese_scraper_utils-0.2.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file chinese_scraper_utils-0.2.0.tar.gz.

File metadata

  • Download URL: chinese_scraper_utils-0.2.0.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for chinese_scraper_utils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 736da65a1c9599539e67ea963dac99f87c478c1d9110285e4965767c865cea79
MD5 bc23eb1872729d16dcc55c13a3a37b48
BLAKE2b-256 585d1917e8f1293f82bd5f5b3a594a3b0b7dc5a82b502b7d896050af76d57ca5

See more details on using hashes here.

File details

Details for the file chinese_scraper_utils-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chinese_scraper_utils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d655dbec5399d579a523c6c04aeb661064d5591c92f678b0dda44415cc23c20a
MD5 738188e6877c6c12b886ae0f78e977e2
BLAKE2b-256 dcdfac7cf00fb154431d0d4cce3707d0847fd334dda4977d85d39db57c23c62d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page