Skip to main content

scrapy_huo_utilities是自用的一些scrapy工具代码合集。

Project description

scrapy_huo_utilities

介绍

包含一个随机UserAgent和基于leveldb的DuplicateUrls中间件、几个清洗html的函数,自用的几个常用功能。

安装教程

pip install scrapy_huo_utilities

使用说明

  1. RandomUserAgent_DOWNLOADER_MIDDLEWARES:

    settings.pyDOWNLOADER_MIDDLEWARES中修改添加: "scrapy_huo_utilities.MIDDLEWARES.Downloader_Middleware_Utils.RandomUserAgent_DOWNLOADER_MIDDLEWARES": 543,

  2. DuplicateUrls_SPIDER_MIDDLEWARES: 在settings.pySPIDER_MIDDLEWARES中修改添加: "scrapy_huo_utilities.MIDDLEWARES.Spider_Middleware_Utils.DuplicateUrls_SPIDER_MIDDLEWARES": 543,

  3. html_clean: from scrapy_huo_utilities.PROCESSORS import clean_html_attributes,clean_html_tags,clean_empty_img_tags

    clean_empty_img_tags:清除没有src属性或src属性为空字符串的img标签。

    clean_empty_a_tags:清除没有href属性或href属性为空字符串的a标签。

    clean_html_tags:用于清除指定列表中的HTML标签。

    :param html_data:  传入html
    :param remove_tags:删除标签列表
    :param reserve_content:是否保留删除标签的文本内容,默认保留。
    :return: 处理后的html
    

    clean_html_attributes:用于清除HTML标签中的属性,除了白名单中包含的属性。

    :param html_data:传入html
    :param whitelist:白名单
    :return: html_data处理后的html
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_huo_utilities-0.0.4.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

scrapy_huo_utilities-0.0.4-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_huo_utilities-0.0.4.tar.gz.

File metadata

  • Download URL: scrapy_huo_utilities-0.0.4.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.9 Windows/10

File hashes

Hashes for scrapy_huo_utilities-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4415be9981067217e072d4ad708fdbfa44a1126ef8e7c2e822c9bd2d86876907
MD5 db6cbed784de667b066bd0c95744f948
BLAKE2b-256 dfbc2a1c8b782b0d5b008948e646e7c356fd515c296c53fe85213cc1d6326345

See more details on using hashes here.

File details

Details for the file scrapy_huo_utilities-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_huo_utilities-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1e4ebd12c853844fce9bec1e190e2db9b9e17b7ea1f80b544b85c74bf7f31ba3
MD5 b07e2807c54d4ed8ba1f1de450ba92df
BLAKE2b-256 cfee0e837eba9ed9f69c7d6fcca798e7607021dcbe73efa3f9d54905293a8588

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page