Skip to main content

Crawlo 是一款基于异步IO的高性能Python爬虫框架,支持分布式抓取。

Project description

Crawlo 爬虫框架

Crawlo 是一个高性能、可扩展的 Python 爬虫框架,支持单机和分布式部署。

特性

  • 高性能异步爬取
  • 支持多种下载器 (aiohttp, httpx, curl-cffi)
  • 内置数据清洗和验证
  • 分布式爬取支持
  • 灵活的中间件系统
  • 强大的配置管理系统
  • 详细的日志记录和监控
  • Windows 和 Linux 兼容

安装

pip install crawlo

或者从源码安装:

git clone https://github.com/your-username/crawlo.git
cd crawlo
pip install -r requirements.txt
pip install .

快速开始

from crawlo import Spider

class MySpider(Spider):
    name = 'example'
    
    def parse(self, response):
        # 解析逻辑
        pass

# 运行爬虫
# crawlo run example

日志系统

Crawlo 拥有一个功能强大的日志系统,支持多种配置选项:

基本配置

from crawlo.logging import configure_logging, get_logger

# 配置日志系统
configure_logging(
    LOG_LEVEL='INFO',
    LOG_FILE='logs/app.log',
    LOG_MAX_BYTES=10*1024*1024,  # 10MB
    LOG_BACKUP_COUNT=5
)

# 获取logger
logger = get_logger('my_module')
logger.info('这是一条日志消息')

高级配置

# 分别配置控制台和文件日志级别
configure_logging(
    LOG_LEVEL='INFO',
    LOG_CONSOLE_LEVEL='WARNING',  # 控制台只显示WARNING及以上级别
    LOG_FILE_LEVEL='DEBUG',       # 文件记录DEBUG及以上级别
    LOG_FILE='logs/app.log',
    LOG_INCLUDE_THREAD_ID=True,   # 包含线程ID
    LOG_INCLUDE_PROCESS_ID=True   # 包含进程ID
)

# 模块特定日志级别
configure_logging(
    LOG_LEVEL='WARNING',
    LOG_LEVELS={
        'my_module.debug': 'DEBUG',
        'my_module.info': 'INFO'
    }
)

性能监控

from crawlo.logging import get_monitor

# 启用日志性能监控
monitor = get_monitor()
monitor.enable_monitoring()

# 获取性能报告
report = monitor.get_performance_report()
print(report)

日志采样

from crawlo.logging import get_sampler

# 设置采样率(只记录30%的日志)
sampler = get_sampler()
sampler.set_sample_rate('my_module', 0.3)

# 设置速率限制(每秒最多100条日志)
sampler.set_rate_limit('my_module', 100)

Windows 兼容性说明

在 Windows 系统上使用日志轮转功能时,可能会遇到文件锁定问题。为了解决这个问题,建议安装 concurrent-log-handler 库:

pip install concurrent-log-handler

Crawlo 框架会自动检测并使用这个库来提供更好的 Windows 兼容性。

如果未安装 concurrent-log-handler,在 Windows 上运行时可能会出现以下错误:

PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。

文档

请查看 文档 获取更多信息。

许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlo-1.4.4.tar.gz (395.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlo-1.4.4-py3-none-any.whl (596.1 kB view details)

Uploaded Python 3

File details

Details for the file crawlo-1.4.4.tar.gz.

File metadata

  • Download URL: crawlo-1.4.4.tar.gz
  • Upload date:
  • Size: 395.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for crawlo-1.4.4.tar.gz
Algorithm Hash digest
SHA256 3b112516aefa39698e0c0aa7fd65f2bb456e26c8b5d6e6b590b6178b03c30c1d
MD5 8d86b5ad5e96a952559c57f913f27853
BLAKE2b-256 88c86e9c00bed5250995ac9058338b5725c6312be4ce0e43a6a7da7e8ec28b65

See more details on using hashes here.

File details

Details for the file crawlo-1.4.4-py3-none-any.whl.

File metadata

  • Download URL: crawlo-1.4.4-py3-none-any.whl
  • Upload date:
  • Size: 596.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for crawlo-1.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9da409d57b6416d4d1148d3c363fb4459e4e66eecfb96342e732fd8e36c63b71
MD5 64af081aa8db0e90b700ee80b589ea47
BLAKE2b-256 5c0c27b5fc6e81ddc802dec3c387a1517b6c51f9fa890811f449106bbe1ee49f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page