Skip to main content

easier for you to use internet spider

Project description

SimpleSpider Instruction

how to install

pip install SimpleSpider

This is a module to help you use network spider easier.

How to install

pip install SimpleSpider

Using in command

There are 9 argument when you use in the command.

argument type default desctipyion
url str None Your url
single bool True If you want to use script to get the content from series of page,you can set it as False and se the index.
re str None Regular Expression setting use,dont forget to use "" ,eg: --re "ab*c"
xpath str None Xpath setting use, dont forget to use "",eg:--xpath "//*div[0]/text()"
index str default use "," to spite the index, eg --index 1,2,3,4,5,6,7
print bool True if you dont want to print out it in the console,set it as False
output str None if you want to export your result, use it to set the path,eg: --output "D:/data.xlsx."
mode str None you can use "img", "xp" and "re" to set mode get img urls,or use xpath, or regular expression
indexfile str None you can directly read the link by file

Example 1: get the data with Regular Expression from single Page.

SimpleSpider --mode re --url https://www.163.com --re "<title>(.*.?)</title>"

output: 网易

Example 2: get the data with Xpath from single Page
SimpleSpider --mode xp--url https://www.163.com --xpath "//title/text()"

output:
网易

Example 3: get the data with Xpath from mulitiple Page
SimpleSpider --mode xp --url https://ent.163.com/20/0323/ --re "<title>(.*.?)</title>" --single False --index 08/F8D2BVI700038FO9.html,10/F8D8B35800038FO9.html

output:
'疫情期间还出游?网友在巴厘岛偶遇霍建华林心如_网易娱乐'
'台湾女星刘真去世:上《康熙》走红 当郭台铭红娘_网易娱乐'

Example 4: get the data with Xpath from mulitiple Page/link SimpleSpider --mode xp --url https://ent.163.com/20/0323/ --re "<title>(.*.?)</title>" --single False --indexfile data.txt the indexfile should write like this: 1.html 2.html 3.html and the url are http://example.com/test (here is the index)

Example 5: get the data with Xpath from single Page
SimpleSpider --mode img --url https://www.baidu.com

output:
//www.baidu.com/img/gs.gif

If you want to use the function in this model,you just need to:

from SimpleSpider import SimpleSpider

there are some function for you to simply the code
Example 1:

result = SinglePageGetByRegEx(Url=http://www.163.com,Re="<title>(.*?.)")
the value of result is ['网易']

Example 2: List = [53,54,55,56]
result = MulityPageGetByRegEx(Url="http://www.oursteps.com.au/bbs/forum.php?mod=forumdisplay&fid=", IndexList=List,RegEx="<title>(.*?.)</title>") the value of result is [['生活其他 - 新足迹 - 新足迹澳洲华人生活大全'], ['证券外汇 - 新足迹澳洲华人生活大全'], ['个人理财 - 新足迹澳洲华人生活大全'], ['生意种种 - 新足迹澳洲华人生活大全']]

Xpath and Regular Expression are avaluable to be used.

also you can directly get the middle string in a page. Example 3: the html page is

<html>
<title>网易</title>
</html>  

result = SinglePageGetMiddleStr(http://www.163.com,front="<title>,back="</title>")
output
['网易']

also you can directly get the image in a page. result = SinglePageGetImgUrl(http://www.baidu.com")
output
//www.baidu.com/img/gs.gif

if you want to know more, please visit : https://github.com/shanzhengliu/SimpleSpider

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleSpider-0.1.2.tar.gz (6.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page