easier for you to use internet spider
Project description
SimpleSpider Instruction
how to install
pip install SimpleSpider
This is a module to help you use network spider easier.
How to install
pip install SimpleSpider
Using in command
There are 9 argument when you use in the command.
argument | type | default | desctipyion |
---|---|---|---|
url | str | None | Your url |
single | bool | True | If you want to use script to get the content from series of page,you can set it as False and se the index. |
re | str | None | Regular Expression setting use,dont forget to use "" ,eg: --re "ab*c" |
xpath | str | None | Xpath setting use, dont forget to use "",eg:--xpath "//*div[0]/text()" |
index | str | default | use "," to spite the index, eg --index 1,2,3,4,5,6,7 |
bool | True | if you dont want to print out it in the console,set it as False | |
output | str | None | if you want to export your result, use it to set the path,eg: --output "D:/data.xlsx." |
mode | str | None | you can use "img", "xp" and "re" to set mode get img urls,or use xpath, or regular expression |
indexfile | str | None | you can directly read the link by file |
Example 1: get the data with Regular Expression from single Page.
SimpleSpider --mode re --url https://www.163.com --re "<title>(.*.?)</title>"
output:
网易
Example 2:
get the data with Xpath from single Page
SimpleSpider --mode xp--url https://www.163.com --xpath "//title/text()"
output:
网易
Example 3:
get the data with Xpath from mulitiple Page
SimpleSpider --mode xp --url https://ent.163.com/20/0323/ --re "<title>(.*.?)</title>" --single False --index 08/F8D2BVI700038FO9.html,10/F8D8B35800038FO9.html
output:
'疫情期间还出游?网友在巴厘岛偶遇霍建华林心如_网易娱乐'
'台湾女星刘真去世:上《康熙》走红 当郭台铭红娘_网易娱乐'
Example 4:
get the data with Xpath from mulitiple Page/link
SimpleSpider --mode xp --url https://ent.163.com/20/0323/ --re "<title>(.*.?)</title>" --single False --indexfile data.txt
the indexfile should write like this:
1.html
2.html
3.html
and the url are http://example.com/test (here is the index)
Example 5:
get the data with Xpath from single Page
SimpleSpider --mode img --url https://www.baidu.com
output:
//www.baidu.com/img/gs.gif
If you want to use the function in this model,you just need to:
from SimpleSpider import SimpleSpider
there are some function for you to simply the code
Example 1:
result = SinglePageGetByRegEx(Url=http://www.163.com,Re="<title>(.*?.)")
the value of result is ['网易']
Example 2:
List = [53,54,55,56]
result = MulityPageGetByRegEx(Url="http://www.oursteps.com.au/bbs/forum.php?mod=forumdisplay&fid=", IndexList=List,RegEx="<title>(.*?.)</title>")
the value of result is [['生活其他 - 新足迹 - 新足迹澳洲华人生活大全'], ['证券外汇 - 新足迹澳洲华人生活大全'], ['个人理财 - 新足迹澳洲华人生活大全'], ['生意种种 - 新足迹澳洲华人生活大全']]
Xpath and Regular Expression are avaluable to be used.
also you can directly get the middle string in a page. Example 3: the html page is
<html>
<title>网易</title>
</html>
result = SinglePageGetMiddleStr(http://www.163.com,front="<title>,back="</title>")
output
['网易']
also you can directly get the image in a page.
result = SinglePageGetImgUrl(http://www.baidu.com")
output
//www.baidu.com/img/gs.gif
if you want to know more, please visit : https://github.com/shanzhengliu/SimpleSpider
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file SimpleSpider-0.1.3.tar.gz
.
File metadata
- Download URL: SimpleSpider-0.1.3.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c35fe8930e492d2ba822329f600a738532102caa51a719a8f4b2e7f526e9c9a |
|
MD5 | 8c7b394325408809b2bd562cb79ff85d |
|
BLAKE2b-256 | 1ec3330546d29857553aa614ce4c2c967c562a4bb0c431c66a1bf2d23c9847a2 |