my simple crawler
Project description
Install
pip3 install simple-crawler
Set environment AUTO_CHARSET=1 to pass bytes to beautifulsoup4 and let it detect the charset.
Classes
URL: define a URLURLExt: class to handleURLPage: define a request result of aURLurl: typeURLcontent,text,json: response content properties from libraryrequeststype: the response body type, is a enum which allowsBYTES,TEXT,HTML,JSONis_html: check whether is html accorrding to the response headers'sContent-Typesoup:BeautifulSoupcontains html ifis_html
Crawler: schedule the crawler by callinghandler_page()recusively
Example
from simple_crawler import *
class MyCrawler(Crawler):
name = 'output.txt'
aysnc def custom_handle_page(self, page):
print(page.url)
tags = page.soup.select("#container")
tag = tags and tags[0]
with open(self.name, 'a') as f:
f.write(tag.text)
# do some async call
def filter_url(self, url: URL) -> bool:
return url.url.startswith("https://xxx.com/xxx")
loop = get_event_loop(True)
c = MyCrawler("https://xxx.com/xxx", loop, concurrency=10)
schedule_future_in_loop(c.start(), loop=loop)
TODO
- Speed up using async or threading
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simple_crawler-1.2-py3-none-any.whl.
File metadata
- Download URL: simple_crawler-1.2-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/28.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f61142598cf572cdc323aa12c1bc6619ad033f21254e9986e7acefde07c035bc
|
|
| MD5 |
0b3e4430c45c054d877cf6fc701a649c
|
|
| BLAKE2b-256 |
fd58c4e510ff908d0bd3ee3f457f53960e8059ed37b425a8458bd11e511366aa
|