A collection of Crawlers
Project description
crawlers
介绍
爬虫集合
可获取的项目
- hugging face 上的模型文件
项目示例
1. hugging face
from pycrawlers import huggingface
urls = ['https://huggingface.co/albert-base-v2/tree/main',
'https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease/tree/main']
paths = ['./model_1/albert-base-v2', './model_2/']
# 实例化类
# 使用默认 base_url (https://huggingface.co)
hg = huggingface()
# 自定义 base_uel
# hg = huggingface('https://huggingface.co')
# 1. 单个获取
# 1.1 使用默认保存位置('./')
hg.get_data(urls[0])
# 1.2 自定义保存地址
# hg.get_data(urls[0], paths[0])
# 2.批量获取
# 2.1 使用默认保存位置('./')
hg.get_batch_data(urls)
# 2.2 自定义保存地址
# hg.get_batch_data(urls, paths)
2. 通用抓取网页
可以抓取那些反爬不厉害的网站
-
简单使用
from pycrawlers import website
mongo_host = '' mongo_port = '27017' db_name = 'huxiu' id_collection_name = 'huxiu_id' collection_name = 'huxiu' base_url = 'https://www.huxiu.com'
website(mongo_host, mongo_port, db_name, id_collection_name, collection_name, base_url)
-
进阶使用
-
可以使用url filter 过滤不想抓取的网页,比如视频、图片
from pycrawlers import website from pycrawlers.Web_Site.url_filters import filter_video_photo
mongo_host = '' mongo_port = '27017' db_name = 'huxiu' id_collection_name = 'huxiu_id' collection_name = 'huxiu' base_url = 'https://www.huxiu.com'
url_filter 也可以自己定义
photo_type = ['bmp', 'jpg','png', 'tif', 'gif', 'pcx', 'tga', 'exif', 'fpx', 'svg', 'psd',
'cdr', 'pcd', 'dxf', 'ufo', 'eps', 'ai', 'raw', 'WMF', 'webp', 'avif', 'apng']
video_type = ['wmv', 'asf', 'asx', 'rm', 'rmvb', 'mp4', '3gp', 'mov', 'm4v', 'avi',
'dat', 'mkv', 'flv', 'vob', 'mpeg']
def filter_video_photo(url: str):
all_types = photo_type + video_type
for i in all_types:
if url.endswith('.' + i):
return False
return True
website(mongo_host, mongo_port, db_name, id_collection_name, collection_name, base_url, url_filter=filter_video_photo)
-
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.