Skip to main content

Chinese scraping utilities — date parsing, city extraction, SHA256 ID, UA pool, rate limiter, DeepSeek client

Project description

chinese-scraper-utils

PyPI version Python 3.11+ MIT License

Shared Python utilities for Chinese-language web scraping — date parsing, city extraction, stable ID generation, UA rotation, rate limiting, and a DeepSeek API client. Extracted from ComiRadar and weekly-cli, where these functions had diverged across two codebases.

ComiRadarweekly-cli 中提取的共享 Python 工具集:中文日期解析、城市提取、稳定 ID 生成、UA 池、速率限制、DeepSeek 客户端。


Installation / 安装

pip install chinese-scraper-utils

Usage / 使用示例

Stable ID / 稳定 ID

from chinese_scraper_utils import stable_id

uid = stable_id("北京国际动漫展", "北京", "2026-05-04")
# => "3a8f1c9e2d4b6a05"  (SHA256 hex prefix, deterministic across restarts)

Date Parsing / 日期解析

from chinese_scraper_utils import parse_date, extract_date

# Structured date parsing / 结构化日期解析
parse_date("2026-05-04")           # => "2026-05-04"
parse_date("2026/05/04 14:30:00")  # => "2026-05-04"

# Chinese text date extraction / 中文文本日期提取
extract_date("5月4日上海有漫展")       # => "2026-05-04"
extract_date("2026年5月4日-6日")      # => "2026-05-04"

City Extraction & Normalization / 城市提取与规范化

from chinese_scraper_utils import extract_city, normalize_city, CITIES

extract_city("活动在上海举办")      # => "上海"
extract_city("广州天河区")         # => "广州"

normalize_city("上海市")           # => "上海"
normalize_city("  深圳市  ")       # => "深圳"

Category Guessing / 类别猜测

from chinese_scraper_utils import guess_category

guess_category("五一漫展嘉年华")   # => "漫展"
guess_category("初音未来演唱会")   # => "演唱会"
guess_category("清明上河图展览")   # => "展览"

Random User-Agent / 随机 UA

from chinese_scraper_utils import random_ua, UA_POOL

random_ua()  # => "Mozilla/5.0 (Windows NT 10.0; ..."

Async Rate Limiter / 异步速率限制

import asyncio
from chinese_scraper_utils import RateLimiter

limiter = RateLimiter(min_interval=1.0)

async def fetch():
    async with httpx.AsyncClient() as client:
        resp = await limiter.fetch_with_retry(
            lambda: client.get("https://example.com")
        )
        return resp.text

DeepSeek AI Client / DeepSeek AI 客户端

from chinese_scraper_utils import DeepSeekClient

client = DeepSeekClient(api_key="sk-xxx")
result = client.chat_json([
    {"role": "user", "content": "提取活动信息:北京五一漫展"}
])
# => {"name": "...", "date": "...", "city": "..."}

API Reference / API 参考

Export / 导出项 Type / 类型 Description / 描述
stable_id(*parts) strstr Deterministic SHA256 short ID / 确定性 SHA256 短 ID
parse_date(s) strstr Structured date parsing / 结构化日期解析
extract_date(text) strstr Chinese text date extraction / 中文文本日期提取
CITIES list[str] 52 major Chinese cities / 52 个主要中国城市
extract_city(text) strstr Chinese city name extraction / 城市名提取
normalize_city(city) strstr City name normalization (strip suffix) / 城市名规范化
CATEGORY_ALIASES dict[str, str] Category alias mapping / 类别别名映射
guess_category(title) strstr Category guessing from title / 根据标题猜测类别
UA_POOL list[str] User-Agent pool / User-Agent 池
random_ua() str` Random UA selection / 随机返回 UA
RateLimiter class Async rate limiter with retry / 异步速率限制器
DeepSeekClient class DeepSeek API wrapper / DeepSeek API 封装客户端

License / 许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_scraper_utils-0.1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chinese_scraper_utils-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file chinese_scraper_utils-0.1.0.tar.gz.

File metadata

  • Download URL: chinese_scraper_utils-0.1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for chinese_scraper_utils-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e126a664dda013e91685c6398073ae965d9135b69abd72d2361255ece6f0829d
MD5 759e23ca778e26fb1c4c9ee1d4ad0485
BLAKE2b-256 53ea4a7ffe54c341e7d0800d0c16a2be3878256427ffd06bb170a4481a2652cd

See more details on using hashes here.

File details

Details for the file chinese_scraper_utils-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chinese_scraper_utils-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4bf61c1dea3dc5e86b69799956e6e853214e99be92a2d6de93b8fa48179e9e0c
MD5 19919268d0adfba72493aa73c17ed594
BLAKE2b-256 83752853b9952bc3bbaced74654a77a3c987b4d7a0986bc348adcea6964f591f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page