A python scraper to extract and analyze data from search engine result pages and urls. Extract data, like url, title, snippet of results or ratings for given keywords.
Project description
A python scraper to extract, analyze data from search engine result pages and urls. It might be usefull for SEO and research tasks.
Extract these result types
ads_main - advertisments within regular search results
image - result from image search
news - news teaser within regular search results
results - standard search result
shopping - shopping teaser within regular search results
For each result in a resultspage get
domain
rank
rich snippet
site links
snippet
title
type
url
visible url
Also get a screenshot of each result page. You can also scrape the text content of each result url. It also possible to save the results as CSV for future analytics. If required you can use your own proxylist.
Ressources
See http://serpscrap.readthedocs.io/en/latest/ for documentation.
Source is available at https://github.com/ecoron/SerpScrap
Install
The easy way to do:
pip uninstall SerpScrap -y
pip install SerpScrap --upgrade
More details in the install [1] section of the documentation.
Usage
SerpScrap in your applications
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pprint
import serpscrap
keywords = ['example']
config = serpscrap.Config()
config.set('scrape_urls', False)
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()
for result in results:
pprint.pprint(result)
More detailes in the examples [2] section of the documentation.
To avoid encode/decode issues use this command before you start using SerpScrap in your cli.
chcp 65001
set PYTHONIOENCODING=utf-8
Changes
Notes about major changes between releases
0.9.0
result types added (news, shopping, image)
Image search is supported
0.8.0
text processing tools removed.
less requirements
References
SerpScrap is using PhantomJs [3] a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows). The scrapcore is based on GoogleScraper [4] with several improvements.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for SerpScrap-0.9.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b61fae90bf431bd3a921a90b8a4503d39c1b64cacfb175df8a4b5dfb7a07f692 |
|
MD5 | 068c227b5b4bfbafc8cf80354227218b |
|
BLAKE2b-256 | b76c906f4a1afc76cf95b33c735cd1e12d4ffa4d7fcd6d6e2c706708493affec |
Hashes for SerpScrap-0.9.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b701281a66f3175a4b3c1b330d0e5c01532bceb3a5d921e26353a842f1142539 |
|
MD5 | 46be67351e72532b2e82b186dfc8dbc1 |
|
BLAKE2b-256 | dba8d78b742e13418e09ac00cd92e2f69a7844be3001cf6658e7559b282d2c61 |