Skip to main content

一个面向爬虫与数据处理场景的 Python 工具包,覆盖加密解密、数据存储、异步下载和字体解析

Project description

SpiderKit

一个面向爬虫与数据处理场景的 Python 工具包,覆盖加密解密、数据存储、异步下载、反爬字体解析与常用哈希工具。

功能概览

  • 加密解密: RSA(含长文本分块)、AES/DES/3DES,多种模式与输出格式
  • 数据存储: CSV、JSON、JSONL 格式保存,支持追加写入
  • 异步下载: 高性能并发下载,支持 M3U8 视频分片合并
  • 字体解析: 解析反爬字体文件并生成字符映射
  • 哈希工具: 常用摘要算法与多种输出格式

安装

pip install spiderkit

从源码安装:

pip install -e .

运行环境

  • Python 3.11+
  • 可选依赖: ffmpeg(M3U8 视频合并与转码需要)

模块与核心 API

  • spiderkit.crypto
    • generate_rsa_keypair
    • rsa_encrypt / rsa_encrypt_long / rsa_decrypt / rsa_algorithm
    • aes_encrypt / aes_decrypt / des_encrypt / des_decrypt / des3_encrypt / des3_decrypt
  • spiderkit.downloader
    • Downloader / M3U8Downloader
  • spiderkit.storage
    • save_data_to_file
  • spiderkit.utils
    • parse_font / decrypt_text_with_font_maps / FontParseConfig
    • md5 / sha1 / sha224 / sha256 / sha384 / sha512 / sha3_256
    • blake2b / blake2s
  • spiderkit.config
    • SpiderKitConfig / get_config / set_config

快速开始

加密解密

import os

from spiderkit.crypto import generate_rsa_keypair, rsa_encrypt, rsa_decrypt, aes_encrypt, aes_decrypt

plaintext = "Hello World!"

# RSA 加密解密
public_key, private_key = generate_rsa_keypair()
rsa_encrypted = rsa_encrypt(plaintext, public_key, "OAEP")
print(rsa_encrypted)
rsa_decrypted = rsa_decrypt(rsa_encrypted, private_key, "OAEP")
print(rsa_decrypted)

# AES 加密解密
aes_key = os.urandom(32)
aes_iv = os.urandom(16)
aes_encrypted = aes_encrypt(plaintext, aes_key, "CBC", iv=aes_iv)
print(aes_encrypted)
aes_decrypted = aes_decrypt(aes_encrypted, aes_key, "CBC", iv=aes_iv)
print(aes_decrypted)

异步下载

from spiderkit.downloader import Downloader, M3U8Downloader

# 可选请求头(部分网站加了防盗链需要 Referer 字段)
headers = {
    "Referer": "https://www.example.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.98 Safari/537.36"
}

# 普通文件下载
downloader = Downloader(headers=headers)
file_mapping = {
    "images/image1.jpg": "https://example.com/image1.jpg",
    "images/image2.jpg": "https://example.com/image2.jpg"
}
downloader.download_files(file_mapping)

# M3U8 视频下载(需安装 ffmpeg)
m3u8_downloader = M3U8Downloader(headers=headers)
m3u8_downloader.download_video("https://example.com/video.m3u8", "output_video.mp4")

字体解析

from spiderkit.utils import parse_font, decrypt_text_with_font_maps

# 解析字体文件路径或 URL
# font_maps = parse_font("fonts/font.woff")
font_maps = parse_font("https://example.com/font.woff")

# 解密文本
encrypted_text = "加密的文本"
decrypted_text = decrypt_text_with_font_maps(encrypted_text, font_maps)
print(decrypted_text)

哈希计算

from spiderkit.utils import md5, sha1, sha256, sha512, sha3_256, blake2b

text = "Hello World!"

# 默认输出 hex
print(md5(text))
print(sha1(text))
print(sha256(text))
print(sha512(text))

# 其他算法
print(sha3_256(text))
print(blake2b(text))

# 其他输出格式: binary / base64
print(md5(text, "binary"))
print(md5(text, "base64"))

数据存储

from spiderkit.storage import save_data_to_file

data = [
    {"name": "张三", "age": 25},
    {"name": "李四", "age": 30}
]

# 保存为 CSV
save_data_to_file(data, "users", "csv")

# 保存为 JSON
save_data_to_file(data, "users", "json")

# 保存为 JSONL
save_data_to_file(data, "users", "jsonl")

使用建议

  • Downloader.download_files 内部使用 asyncio.run,若你已处于事件循环中,请在外部自行编排协程。
  • save_data_to_file 默认输出目录为 ./data,写入模式默认 a(追加)。

配置

SpiderKit 提供统一配置入口,可在运行时调整行为或用环境变量覆盖。

from spiderkit.config import get_config, set_config

config = get_config()
config.downloader_concurrency = 8
config.storage_default_dir = "./exports"
set_config(config)

常用环境变量:

  • SPIDERKIT_DOWNLOADER_CONCURRENCY
  • SPIDERKIT_DOWNLOADER_TIMEOUT
  • SPIDERKIT_FONT_SIZE
  • SPIDERKIT_FONT_DOWNLOAD_TIMEOUT
  • SPIDERKIT_STORAGE_DEFAULT_DIR
  • SPIDERKIT_STORAGE_DEFAULT_MODE
  • SPIDERKIT_LOG_LEVEL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderkit-0.1.8.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spiderkit-0.1.8-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file spiderkit-0.1.8.tar.gz.

File metadata

  • Download URL: spiderkit-0.1.8.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spiderkit-0.1.8.tar.gz
Algorithm Hash digest
SHA256 bc7dfc45af44447cf0a66ee8643fba9b69273452ec4b141dea4007c9e9cb4539
MD5 6a4595da753bdd8cfcf48f06ccbab23d
BLAKE2b-256 fa7d43298dd505308d87042ba61e8d89761a6aab3afb98923c72a7659fdd5bce

See more details on using hashes here.

File details

Details for the file spiderkit-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: spiderkit-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spiderkit-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 bb7e812342961528364db21703831bb24172dffd22a544c03e9a07f68f442cae
MD5 7096c4eca2bed5211ca95f379012c550
BLAKE2b-256 e7b6a86ba4a0bf415948d403e0883fbf7e3aacde7c286a20cdbbbad9125dc9b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page