scrapy_huo_utilities是自用的一些scrapy工具代码合集。
Project description
scrapy_huo_utilities
介绍
包含一个随机UserAgent和基于leveldb的DuplicateUrls中间件、几个清洗html的函数,自用的几个常用功能。
安装教程
pip install scrapy_huo_utilities
使用说明
-
RandomUserAgent_DOWNLOADER_MIDDLEWARES:
在
settings.py
的DOWNLOADER_MIDDLEWARES
中修改添加:"scrapy_huo_utilities.MIDDLEWARES.Downloader_Middleware_Utils.RandomUserAgent_DOWNLOADER_MIDDLEWARES": 543,
-
DuplicateUrls_SPIDER_MIDDLEWARES: 在
settings.py
的SPIDER_MIDDLEWARES
中修改添加:"scrapy_huo_utilities.MIDDLEWARES.Spider_Middleware_Utils.DuplicateUrls_SPIDER_MIDDLEWARES": 543,
-
html_clean:
from scrapy_huo_utilities.PROCESSORS import clean_html_attributes,clean_html_tags,clean_empty_img_tags
clean_empty_img_tags
:清除没有src属性或src属性为空字符串的img标签。clean_empty_a_tags
:清除没有href属性或href属性为空字符串的a标签。clean_html_tags
:用于清除指定列表中的HTML标签。:param html_data: 传入html :param remove_tags:删除标签列表 :param reserve_content:是否保留删除标签的文本内容,默认保留。 :return: 处理后的html
clean_html_attributes
:用于清除HTML标签中的属性,除了白名单中包含的属性。:param html_data:传入html :param whitelist:白名单 :return: html_data处理后的html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy_huo_utilities-0.0.4.tar.gz
.
File metadata
- Download URL: scrapy_huo_utilities-0.0.4.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.9 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4415be9981067217e072d4ad708fdbfa44a1126ef8e7c2e822c9bd2d86876907 |
|
MD5 | db6cbed784de667b066bd0c95744f948 |
|
BLAKE2b-256 | dfbc2a1c8b782b0d5b008948e646e7c356fd515c296c53fe85213cc1d6326345 |
File details
Details for the file scrapy_huo_utilities-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: scrapy_huo_utilities-0.0.4-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.9 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e4ebd12c853844fce9bec1e190e2db9b9e17b7ea1f80b544b85c74bf7f31ba3 |
|
MD5 | b07e2807c54d4ed8ba1f1de450ba92df |
|
BLAKE2b-256 | cfee0e837eba9ed9f69c7d6fcca798e7607021dcbe73efa3f9d54905293a8588 |