Skip to main content

Very easy and tiny crawling framework, support multithread processing.

Project description

PyPI version Build Status Documentation Status PyPI - Python Version GitHub

Overview 概览

tinyCrawl 是一个微型的爬虫框架,具有以下特点:

  • 简单轻巧,没有任何第三方包的依赖
  • checkpoint断点续爬
  • 支持多线程爬取
  • 内置日志功能
  • 使用简单

Documentation 文档

访问官方说明文档

Installation 安装

pip install tinyCrawl

How to use 如何使用

tinyCrawl 支持2种运行方法

  • By using funciton 函数式:定义爬虫的方法 task(),实例化 BaseCrawl(iter_url, iter_num_range, thread_num),调用 run 方法执行
  • By inheritance 继承式:继承 BaseCrawl(iter_url, iter_num_range, thread_num) ,重写 crawl()sink() 方法,其中 crawl()类似于上一方法中的爬虫方法task(),定义爬取单页的爬虫代码, sink()是将结果输出的方法,最后执行 main() 方法执行总程序

By using funciton 函数式

# -*- coding: utf-8 -*-

from tinyCrawl import BaseCrawl, RowContainer

from urllib.request import urlopen
from lxml import etree

# 定义xpath
song_name_xpath = '//div[@class="song-name"]/a/text()'
singer_xpath = '//div[@class="singers"]/a[1]/text()'
album_xpath = '//div[@class="album"]/a[1]/text()'

def task(url):
    """
    定义爬取单页的爬虫代码

    """
    # 定义数据存放的容器,容器名字就是最后爬取结果存放字典的self.out的key
    song_name_list = RowContainer("song name")
    singer_list = RowContainer("singer")
    album_list = RowContainer("album")

    page = urlopen(url).read().decode("utf-8", 'ignore')
    parse = etree.HTML(page)
    for _song_name, _singer, _album in zip(parse.xpath(song_name_xpath),
                                           parse.xpath(singer_xpath),
                                           parse.xpath(album_xpath)):
       	# 将数据append进指定容器中
        song_name_list.append(str(_song_name))
        singer_list.append(str(_singer))
        album_list.append(str(_album))


# 第一个参数是链接地址,需通过 %s 定义页数等迭代的参数
# 第二个参数是迭代的范围,
# 第三个参数是启用的线程数,大于1就是多线程,等于1就是单线程
bc = BaseCrawl("http://example.com/?page=%s", range(1, 5), 3)
# 输入task的对象,开始执行程序
bc.run(task)
# 执行完毕后,通过out属性,获取结果
print(bc.out)

By inheritance 继承式

from tinyCrawl import BaseCrawl, RowContainer
from urllib.request import urlopen
from lxml import etree
import pandas as pd

# 需继承BaseCrawl类,覆写crawl和sink方法
class Scratch(BaseCrawl):
    def __init__(self, iter_url, iter_num_range, thread_num):
        super().__init__(iter_url, iter_num_range, thread_num)

    # 覆写crawl方法
    def crawl(self, url):
    # 定义数据存放的容器,容器名字就是最后爬取结果存放字典的self.out的key
    song_name_list = RowContainer("song name")
    singer_list = RowContainer("singer")
    album_list = RowContainer("album")

    page = urlopen(url).read().decode("utf-8", 'ignore')
    parse = etree.HTML(page)
    for _song_name, _singer, _album in zip(parse.xpath(song_name_xpath),
                                           parse.xpath(singer_xpath),
                                           parse.xpath(album_xpath)):
       	# 将数据append进指定容器中
        song_name_list.append(str(_song_name))
        singer_list.append(str(_singer))
        album_list.append(str(_album))

    # 覆写sink方法,将爬取的结果输出
    def sink(self):
        # self.out是字典结构的结果,可以直接输入pandas存为dataframe
        recent_music = pd.DataFrame(self.out)
        recent_music.to_csv("D:/tmptest.csv", index=0)


if __name__ == '__main__':
    mc = Scratch("http://example.com/?page=%s", range(1, 5), 3)
    # 调用main函数执行程序
    mc.main()

output:

2021-01-10 16:18:36,944 - base.py - __init__ - [line:30] - INFO: Checkpoint path: D:\breakpoint_page.txt
2021-01-10 16:18:38,539 - base.py - __source - [line:119] - INFO: Now is running on multithread mode, total thread num is `3`
2021-01-10 16:18:38,539 - base.py - __source - [line:126] - INFO: Total iteration num: 4
2021-01-10 16:18:38,541 - base.py - _multi_thread_wrap - [line:59] - INFO: ThreadPoolExecutor-1_0 now is processing: http://example.com/?page=1
2021-01-10 16:18:38,541 - base.py - _multi_thread_wrap - [line:59] - INFO: ThreadPoolExecutor-1_1 now is processing: http://example.com/?page=2
2021-01-10 16:18:38,542 - base.py - _multi_thread_wrap - [line:59] - INFO: ThreadPoolExecutor-1_2 now is processing: http://example.com/?page=3
2021-01-10 16:18:41,544 - base.py - __task_done - [line:115] - INFO: ThreadPoolExecutor-1_1 task finished; (Time took: 3.0009s)
2021-01-10 16:18:41,544 - base.py - __task_done - [line:115] - INFO: ThreadPoolExecutor-1_0 task finished; (Time took: 3.0019s)
2021-01-10 16:18:41,544 - base.py - __task_done - [line:115] - INFO: ThreadPoolExecutor-1_2 task finished; (Time took: 3.0009s)
2021-01-10 16:18:41,545 - base.py - _multi_thread_wrap - [line:59] - INFO: ThreadPoolExecutor-1_1 now is processing: http://example.com/?page=4
2021-01-10 16:18:44,551 - base.py - __task_done - [line:115] - INFO: ThreadPoolExecutor-1_1 task finished; (Time took: 3.0022s)
2021-01-10 16:18:44,551 - base.py - __source - [line:151] - INFO: All done. (Time took: 6.0102s)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyCrawl-0.1.2.tar.gz (10.9 MB view details)

Uploaded Source

Built Distribution

tinyCrawl-0.1.2-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file tinyCrawl-0.1.2.tar.gz.

File metadata

  • Download URL: tinyCrawl-0.1.2.tar.gz
  • Upload date:
  • Size: 10.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.2 CPython/3.7.3

File hashes

Hashes for tinyCrawl-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fa9bafb0920fc429ff473fa0fd6fe7a6fdfd7562c84723d5d5ebaade9c1f8634
MD5 e90de4d5ad9e549f3c1c26df8205459c
BLAKE2b-256 d515d695fba55a999a710c6a527d70911da3bb4411cd8e5d17e02a9734634afc

See more details on using hashes here.

File details

Details for the file tinyCrawl-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: tinyCrawl-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.2 CPython/3.7.3

File hashes

Hashes for tinyCrawl-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 894fee57ac9c5509c199f8a3505553ad744d876eba0b39fa963fe41b84d2b8f5
MD5 ba5f3a663d6a9959f57ac5861d9c99fe
BLAKE2b-256 3995be2ce841221dc2a55887101ab6abc6af6ed64adc5f0382d74f53cab3643a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page