Easy extraction of keywords from search engine results pages (SERPs).
Project description
serpextract provides easy extraction of keywords from search engine results pages (SERPs).
This module is possible in large part to the very hard work of the Piwik team. Specifically, we make extensive use of their list of search engines.
Installation
Latest release on PyPI:
$ pip install serpextract
Or the latest development version:
$ pip install -e git://github.com/Parsely/serpextract.git#egg=serpextract
Usage
Command Line
Command-line usage, returns the engine name and keyword components separated by a comma and enclosed in quotes:
$ serpextract "http://www.google.ca/url?sa=t&rct=j&q=ars%20technica" "Google","ars technica"
You can also print out a list of all the SearchEngineParsers currently available in your local cache via:
$ serpextract -l
Python
from serpextract import get_parser, extract, is_serp, get_all_query_params
non_serp_url = 'http://arstechnica.com/'
serp_url = ('http://www.google.ca/url?sa=t&rct=j&q=ars%20technica&source=web&cd=1&ved=0CCsQFjAA'
'&url=http%3A%2F%2Farstechnica.com%2F&ei=pf7RUYvhO4LdyAHf9oGAAw&usg=AFQjCNHA7qjcMXh'
'j-UX9EqSy26wZNlL9LQ&bvm=bv.48572450,d.aWc')
get_all_query_params()
# ['key', 'text', 'search_for', 'searchTerm', 'qrs', 'keyword', ...]
is_serp(serp_url)
# True
is_serp(non_serp_url)
# False
get_parser(serp_url)
# SearchEngineParser(engine_name='Google', keyword_extractor=['q'], link_macro='search?q={k}', charsets=['utf-8'])
get_parser(non_serp_url)
# None
extract(serp_url)
# ExtractResult(engine_name='Google', keyword=u'ars technica', parser=SearchEngineParser(...))
extract(non_serp_url)
# None
Tests
There are some basic tests for popular search engines, but more are required:
$ pip install -r requirements.txt $ nosetests
Caching
Internally, this module caches an OrderedDict representation of Piwik’s list of search engines which is stored in serpextract/search_engines.pickle. This isn’t intended to change that often and so this module ships with a cached version. You can manually update the local cache via:
$ serpextract -u
This action currently requires PHP (we know, we know). We grab Piwik’s PHP array of all search engines, turn it into JSON and then read in as an OrderedDict. Eventually we should just parse the PHP script with a regex to eliminate the PHP requirement.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.