Skip to main content

一个面向爬虫与数据处理场景的 Python 工具包,覆盖加密解密、数据存储、异步下载和字体解析

Project description

SpiderKit

一个面向爬虫与数据处理场景的 Python 工具包,覆盖加密解密、数据存储、异步下载、反爬字体解析与常用哈希工具。

功能概览

  • 加密解密: RSA(含长文本分块)、AES/DES/3DES,多种模式与输出格式
  • 数据存储: CSV、JSON、JSONL 格式保存,支持追加写入
  • 异步下载: 高性能并发下载,支持 M3U8 视频分片合并
  • 字体解析: 解析反爬字体文件并生成字符映射
  • 哈希工具: 常用摘要算法与多种输出格式

安装

pip install spiderkit

从源码安装:

pip install -e .

运行环境

  • Python 3.11+
  • 可选依赖: ffmpeg(M3U8 视频合并与转码需要)

模块与核心 API

  • spiderkit.crypto
    • generate_rsa_keypair
    • rsa_encrypt / rsa_encrypt_long / rsa_decrypt / rsa_algorithm
    • aes_encrypt / aes_decrypt / des_encrypt / des_decrypt / des3_encrypt / des3_decrypt
  • spiderkit.downloader
    • Downloader / M3U8Downloader
  • spiderkit.storage
    • save_data_to_file
  • spiderkit.utils
    • parse_font / decrypt_text_with_font_maps / FontParseConfig
    • md5 / sha1 / sha224 / sha256 / sha384 / sha512 / sha3_256
    • blake2b / blake2s
  • spiderkit.config
    • SpiderKitConfig / get_config / set_config

快速开始

加密解密

import os
from spiderkit.crypto import (
    generate_rsa_keypair,
    rsa_encrypt,
    rsa_decrypt,
    aes_encrypt,
    aes_decrypt,
)

plaintext = "Hello SpiderKit!"

# RSA 加密解密
public_key, private_key = generate_rsa_keypair()
rsa_encrypted = rsa_encrypt(plaintext, public_key, "OAEP")
print(rsa_encrypted)
rsa_decrypted = rsa_decrypt(rsa_encrypted, private_key, "OAEP")
print(rsa_decrypted)

# AES 加密解密
aes_key = os.urandom(32)
aes_iv = os.urandom(16)
aes_encrypted = aes_encrypt(plaintext, aes_key, "CBC", iv=aes_iv)
print(aes_encrypted)
aes_decrypted = aes_decrypt(aes_encrypted, aes_key, "CBC", iv=aes_iv)
print(aes_decrypted)

异步下载

from spiderkit.downloader import Downloader, M3U8Downloader

# 可选请求头(部分网站加了防盗链需要 Referer 字段)
headers = {
    "Referer": "https://www.example.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.98 Safari/537.36"
}

# 普通文件下载
downloader = Downloader(headers=headers)
file_mapping = {
    "images/image1.jpg": "https://example.com/image1.jpg",
    "images/image2.jpg": "https://example.com/image2.jpg"
}
downloader.download_files(file_mapping)

# M3U8 视频下载(需安装 ffmpeg)
m3u8_downloader = M3U8Downloader(headers=headers)
m3u8_downloader.download_video("https://example.com/video.m3u8", "output_video.mp4")

字体解析

from spiderkit.utils import parse_font, decrypt_text_with_font_maps

# 解析字体文件路径或 URL
# font_maps = parse_font("fonts/font.woff")
font_maps = parse_font("https://example.com/font.woff")

# 解密文本
encrypted_text = "加密的文本"
decrypted_text = decrypt_text_with_font_maps(encrypted_text, font_maps)
print(decrypted_text)

哈希计算

from spiderkit.utils import md5, sha1, sha256, sha512, sha3_256, blake2b

text = "Hello SpiderKit!"

# 默认输出 hex
print(md5(text))
print(sha1(text))
print(sha256(text))
print(sha512(text))

# 其他算法
print(sha3_256(text))
print(blake2b(text))

# 其他输出格式: binary / base64
print(md5(text, "binary"))
print(md5(text, "base64"))

数据存储

from spiderkit.storage import save_data_to_file

data = [
    {"name": "张三", "age": 25},
    {"name": "李四", "age": 30}
]

# 保存为 CSV
save_data_to_file(data, "users", "csv")

# 保存为 JSON
save_data_to_file(data, "users", "json")

# 保存为 JSONL
save_data_to_file(data, "users", "jsonl")

使用建议

  • Downloader.download_files 内部使用 asyncio.run,若你已处于事件循环中,请在外部自行编排协程。
  • save_data_to_file 默认输出目录为 ./data,写入模式默认 a(追加)。

配置

SpiderKit 提供统一配置入口,可在运行时调整行为或用环境变量覆盖。

from spiderkit.config import SpiderKitConfig, get_config, set_config

config = get_config()
config.downloader_concurrency = 8
config.storage_default_dir = "./exports"
set_config(config)

常用环境变量:

  • SPIDERKIT_DOWNLOADER_CONCURRENCY
  • SPIDERKIT_DOWNLOADER_TIMEOUT
  • SPIDERKIT_FONT_SIZE
  • SPIDERKIT_FONT_DOWNLOAD_TIMEOUT
  • SPIDERKIT_STORAGE_DEFAULT_DIR
  • SPIDERKIT_STORAGE_DEFAULT_MODE
  • SPIDERKIT_LOG_LEVEL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderkit-0.1.1.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spiderkit-0.1.1-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file spiderkit-0.1.1.tar.gz.

File metadata

  • Download URL: spiderkit-0.1.1.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.6

File hashes

Hashes for spiderkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 77310cfea3f7d847610d5c70edb6d773b2ea0f7015a9447f22503122df46afaf
MD5 2e8ec35d5d6fc5d4edf9de4fae3cf547
BLAKE2b-256 40a9fd5059e3ad0cf579763a47cff368caa5e6bbbb8b20c4902467ef78969a82

See more details on using hashes here.

File details

Details for the file spiderkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: spiderkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.6

File hashes

Hashes for spiderkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b6f4a415a713784fab99855a9fc15a928371f50ab483810da7748108ae8a5839
MD5 54e9864207047d0f9c5eefe7e60ebd31
BLAKE2b-256 070cc91a434c8a90a144a2ac4b7b5a47164fb538fa6f6989d19a45649bac42d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page