Skip to main content

A python scraper to extract and analyze data from search engine result pages and urls. Extract data, like url, title, snippet of results or ratings for given keywords.

Project description

https://img.shields.io/pypi/v/SerpScrap.svg Documentation Status https://travis-ci.org/ecoron/SerpScrap.svg?branch=master https://img.shields.io/docker/pulls/ecoron/serpscrap.svg

SEO python scraper to extract and analyze data from major search engine serps or text content of any other url. Extract data like title, url, type, text- and richsnippet of searchresults for given keywords. detect ads, automated screenshots. It might be usefull for SEO and research tasks.

Extract these result types

  • ads_main - advertisements within regular search results

  • image - result from image search

  • news - news teaser within regular search results

  • results - standard search result

  • shopping - shopping teaser within regular search results

For each result in a resultspage get

  • domain

  • rank

  • rich snippet

  • site links

  • snippet

  • title

  • type

  • url

  • visible url

Also get a screenshot of each result page. You can also scrape the text content of each result url. It also possible to save the results as CSV for future analytics. If required you can use your own proxylist.

Ressources

See http://serpscrap.readthedocs.io/en/latest/ for documentation.

Source is available at https://github.com/ecoron/SerpScrap

Install

The easy way to do:

pip uninstall SerpScrap -y
pip install SerpScrap --upgrade

More details in the install [1] section of the documentation.

Usage

SerpScrap in your applications

 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
 import pprint
 import serpscrap

 keywords = ['example']

 config = serpscrap.Config()
 config.set('scrape_urls', False)

 scrap = serpscrap.SerpScrap()
 scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()

 for result in results:
     pprint.pprint(result)

More detailes in the examples [2] section of the documentation.

To avoid encode/decode issues use this command before you start using SerpScrap in your cli.

chcp 65001
set PYTHONIOENCODING=utf-8
https://raw.githubusercontent.com/ecoron/SerpScrap/master/docs/logo.png

Changes

Notes about major changes between releases

0.10.0

  • support for headless chrome, adjusted default time between scrapes

0.9.0

  • result types added (news, shopping, image)

  • Image search is supported

0.8.0

  • text processing tools removed.

  • less requirements

References

SerpScrap is using PhantomJs [3] a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows). The scrapcore is based on GoogleScraper [4] with several improvements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SerpScrap-0.10.3.tar.gz (35.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

SerpScrap-0.10.3-py3.7.egg (96.4 kB view details)

Uploaded Egg

SerpScrap-0.10.3-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file SerpScrap-0.10.3.tar.gz.

File metadata

  • Download URL: SerpScrap-0.10.3.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for SerpScrap-0.10.3.tar.gz
Algorithm Hash digest
SHA256 4cf76a9bc967832e2a83ddfb43b699fca8d72242052e5cca459430803bcc6576
MD5 e6a162373d7854f2232cdd1aeefc4db3
BLAKE2b-256 0c54d4c8304d887ac9ed9bc554b0ac28660c704614576a8cf196144e7d731b2d

See more details on using hashes here.

File details

Details for the file SerpScrap-0.10.3-py3.7.egg.

File metadata

  • Download URL: SerpScrap-0.10.3-py3.7.egg
  • Upload date:
  • Size: 96.4 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for SerpScrap-0.10.3-py3.7.egg
Algorithm Hash digest
SHA256 2dc8f9de2bbb57c472bc5ec3d1bcda4c7038b8761510333ecb91a00ddb73acfe
MD5 00a336b32301d37fe2dffefea4d3dfd2
BLAKE2b-256 175d91c430de4a1fdf0ce28df0f61c28892ac7c2f27990ebfcd81c8e5fd370a0

See more details on using hashes here.

File details

Details for the file SerpScrap-0.10.3-py3-none-any.whl.

File metadata

  • Download URL: SerpScrap-0.10.3-py3-none-any.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for SerpScrap-0.10.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a4c7b1436e4fb23ea046a55b664e527f0422a4080f35ed59c3c12b3de8108dec
MD5 0d692bd31ca247a445f937b80cdcf1b5
BLAKE2b-256 c6022a40e1bb527e3c461d49de98f48ebe1765fd2bceea4abacace08b8b2e9a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page