Skip to main content

A client for communicating with the Google Search Appliance.

Project description

build status

A client library for the Google Search Appliance, to make retrieving search results in Python easier.

Installation

This module is in PyPi as ubuntudesign.gsa. You should be able to install it simply with:

pip install ubuntudesign.gsa

GSAClient

This is a basic client for querying a Google Search Appliance.

Making queries

You can query the GSA using the search method.

search_client = GSAClient(base_url="http://gsa.example.com/search")

first_ten_results = search_client.search("hello world")

first_thirty_results = search_client.search("hello world", num=30)

results_twenty_to_forty = search_client.search(
  "hello world", start=20, num=20
)

This will set the q, start (default: 0) and num (default: 10) and lr (default: ‘’) parameters. No other search parameters, will be provided, so they will all fall back to their defaults.

The returned results object will attempt to map each of the GSA’s standard result XML tags into a more readable format:

{
    'estimated_total_results': int,  # "M": GSA's estimate, see below
    'document_filtering': bool,      # "FI": Is filtering enabled?
    'next_url': str,                 # "NU": GSA URL for querying the next set of results, if available
    'previous_url': str,             # "PU": Ditto for previous set of results
    'items': [
        {
            'index': int,            # "R[N]": The number of this result in the index of all results
            'url': str,              # "U": The URL of the resulting page
            'encoded_url': str,      # "UE": The above URL, encoded
            'title': str,            # "T": The page title
            'relevancy': int,        # "RK": How relevant is this result to the query? From 0 to 10
            'appliance_id': str,     # "ENT_SOURCE": The serial number of the GSA
            'summary': str,          # "S": Summary text for this result
            'language': str,         # "LANG": The language of the page
            'details': {}            # "FS": Name:value pairs of any extra info
            'link_supported': bool,  # "L": “link:” special query term is supported,
            'cache': {               # "C": Dictionary, or "None" if cache is not available
                'size': str,         # "C[SZ]": Human readable size of cached page
                'cache_id': str,     # "C[CID]": ID of document in GSA's cache
                'encoding': str      # "C[ENC]": The text encoding of the cached page
            }
        },
        ...
    ]
}

Filtering by domain or language

You can filter your search results by specifying specific domains or a specific language.

english_results = search_client.search("hello world", language="lang_en")
non_english_results = search_client.search("hello world", language="-lang_en")
domain_specific_results = search_client.search(
    "hello world",
    domains=["site1.example.com", "site2.example.com"]
)

NB: If no search results are found with the specified language, the GSA will fall back to returning any results it finds in all languages.

Getting accurate totals

At the time of writing, the Google Search Appliance will return an “estimate” of the total number of results with each query, but this estimate is usually wildly inaccurate, sometimes out by more than a factor of 10! This is true even with rc enabled.

With the total_results method, the client will attempt to request results 990 - 1000. This will usually result in the GSA returning the last page of results, which allows us to find the actual total number of results.

total = search_client.total_results("hello world", domains=[], language='')

Django view

To simplify usage of the GSA client with Django, a Django view is included with this module.

Usage

At the minimum, need to provide the SEARCH_SERVER_URL setting to tell the view where to find the GSA:

# settings.py
SEARCH_SERVER_URL = 'http://gsa.example.com/search'  # Required: GSA location
SEARCH_DOMAINS = ['site1.example.com']               # Optional: By default, limit results to this set of domains
SEARCH_LANGUAGE = 'lang_zh-CN'                       # Optional: By default, limit results to this language

# urls.py
from ubuntudesign.gsa.views import SearchView
urlpatterns += [url(r'^search/?$', SearchView.as_view(template_name="search.html"))]

This view will then be available to be queried:

  • example.com/search?q=my+search+term

  • example.com/search?q=my+search+term&domain=example.com&domain=something.example.com (overrides SEARCH_DOMAINS)

  • example.com/search?q=my+search+term&language=-lang_zh-CN (exclude results in Chinese, overrides SEARCH_LANGUAGE)

After retrieving search results, the view will pass the context object to the specified template_name (in this case search.html).

The context object will be structured as follows:

{
    'query': str,       # The value of the `q` parameters passed to the view
    'limit': int,       # The value of the `limit` parameter, or the default of 10
    'offset': int,      # The value of the `offset` parameter, or the default of 0
    'error': None|str,  # None, or a description of the error if one occurred
    'results': {
        'items': [],    # The list of items as returned from the GSAClient (see above)
        'total': int,   # The exact total number of results available
        'start': int,   # The index of the first result in the set
        'end': int,     # The index of the last result in the set
        'next_offset': int|None,      # The offset for the next page of results, if available
        'previous_offset': int|None,  # The offset for the previous page of results, if available
        'last_page_offset': int,      # The offset for the last page of results
        'last_page': int,             # The final page number (calculated from "limit" and "total")
        'current_page': int,          # The current page number (calculated from "limit" and "end")
        'penultimate_page': int       # The second-to-last page
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubuntudesign.gsa-1.0.1.tar.gz (6.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page