Skip to main content

A collection of Crawlers

Project description

crawlers

介绍

爬虫集合

可获取的项目

  1. huggingface 上的模型文件
  2. 无反爬的网站

项目示例

1. hugging face

from pycrawlers import huggingface

urls = ['https://huggingface.co/albert-base-v2/tree/main',
        'https://huggingface.co/dmis-lab/biosyn-sapbert-bc5cdr-disease/tree/main']

paths = ['./model_1/albert-base-v2', './model_2/']
# 实例化类
# 使用默认 base_url (https://huggingface.co)
hg = huggingface()
# 自定义 base_uel
# hg = huggingface('https://huggingface.co')
# 输入 huggingface 的 token,下载需要权限的模型或数据集
# token = "xxxxxyyyyywwwwwccccc"
# hg = huggingface(token=token)

# 1. 单个获取
# 1.1 使用默认保存位置('./')
hg.get_data(urls[0])

# 1.2 自定义保存地址
# hg.get_data(urls[0], paths[0])

# 2.批量获取
# 2.1 使用默认保存位置('./')
hg.get_batch_data(urls)

# 2.2 自定义保存地址
# hg.get_batch_data(urls, paths)

2. 通用抓取网页

可以抓取那些反爬不厉害的网站。 注意:需安装 mongodb,数据将会存储在 mongodb里

2.1 简单使用

from pycrawlers import crawl_website

mongo_host = ''
mongo_port = '27017'
db_name = 'huxiu'
id_collection_name = 'huxiu_id'
collection_name = 'huxiu'
base_url = 'https://www.huxiu.com'


crawl_website(mongo_host, mongo_port, db_name, id_collection_name, collection_name, base_url)

2.2 进阶使用

可以使用url filter 过滤不想抓取的网页,比如视频、图片

from pycrawlers import crawl_website
from pycrawlers.websites.url_filters import filter_video_photo

mongo_host = ''
mongo_port = '27017'
db_name = 'huxiu'
id_collection_name = 'huxiu_id'
collection_name = 'huxiu'
base_url = 'https://www.huxiu.com'

"""
url_filter 也可以自己定义
"""
# photo_type = ['bmp', 'jpg', 'png', 'tif', 'gif', 'pcx', 'tga', 'exif', 'fpx', 'svg', 'psd',
#               'cdr', 'pcd', 'dxf', 'ufo', 'eps', 'ai', 'raw', 'WMF', 'webp', 'avif', 'apng']
# 
# video_type = ['wmv', 'asf', 'asx', 'rm', 'rmvb', 'mp4', '3gp', 'mov', 'm4v', 'avi',
#               'dat', 'mkv', 'flv', 'vob', 'mpeg']
# 
# 
# def filter_video_photo(url: str):
#     all_types = photo_type + video_type
#     for i in all_types:
#         if url.endswith('.' + i):
#             return False
#     return True
  
crawl_website(mongo_host, mongo_port, db_name, id_collection_name, collection_name, base_url, url_filter=filter_video_photo)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycrawlers-1.1.6.tar.gz (14.4 kB view details)

Uploaded Source

File details

Details for the file pycrawlers-1.1.6.tar.gz.

File metadata

  • Download URL: pycrawlers-1.1.6.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for pycrawlers-1.1.6.tar.gz
Algorithm Hash digest
SHA256 6928f5be416146a59fe3726494c9516a565917ab8259b51636ba5c321d0dbb64
MD5 4d5611e53ef81aa634e31ba22bfb10f5
BLAKE2b-256 f5db53f68db453971bb10b8e8e4a477d704c117a4d87251c0d67e3b992a7df00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page