Skip to main content

Web scraping framework

Project description

Grab

Update (2025 year)

Since 2018 (which is the year of most recent Grab release) I have tried to do large refactoring of code base of Grab library. Which ended up with semi-working product which nobody uses, including me. I have decided to reset all project files to the state of most recent pypi release 0.6.41 dated by june 2018. At least, now the code base corresponds to live version of the product which is being used by some people, according to pypi stats.

I've updated Grab code base and code base of its dependencies to be compatible with Python 2.7 and Python 3.13 (and, hopefully, all py versions between these two). I have set up github action to run all tests on Python 2.7 and Python 3.13.

There is NO new features. It is just an updated code base which is alive now i.e. it can run on Python 2.7 or on modern python, and its tests pass, and it has github CI config to run tests on new commits.

One backward-incompatible change is that I do not use weblib.error::DataNotFound and weblib.error::ResponseNotValid exceptions anymore. Now Grab uses DataNotFound and InvalidResponseError exceptions which is stored in grab.errors module. So, if your code imports DataNotFound or ResponseNotValid from weblib, you should fix such imports. Also, if your code explicitly catches these weblib exceptions then you should convert it to catch new grab.error exceptions.

The major version of new release is 1. If you use Grab in your project and you want to keep old release to be sure there is no backward-compatility bugs, then use this specification in your requirements file grab<1.0.

Support

You are welcome to talk about web scraping and data processing in these Telegram chat groups: @grablab (English) and @grablab_ru (Russian)

Documentation: https://grab.readthedocs.io/en/stable/

Installation

Run pip install -U grab

See details about installing Grab on different platforms here https://grab.readthedocs.io/en/stable/usage/installation.html

What is Grab?

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

  • Automatic cookies (session) support
  • HTTP and SOCKS proxy with/without authorization
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API to extract data from DOM tree of HTML documents with XPATH queries
  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.
  • Python 3 ready

Spider is a framework for writing web-site scrapers. Features:

  • Rules and conventions to organize the request/parse logic in separate blocks of codes
  • Multiple parallel network requests
  • Automatic processing of network errors (failed tasks go back to task queue)
  • You can create network requests and parse responses with Grab API (see above)
  • HTTP proxy support
  • Caching network results in permanent storage
  • Different backends for task queue (in-memory, redis, mongodb)
  • Tools to debug and collect statistics

Grab Example

import logging

from grab import Grab

logging.basicConfig(level=logging.DEBUG)

g = Grab()

g.go('https://github.com/login')
g.doc.set_input('login', '****')
g.doc.set_input('password', '****')
g.doc.submit()

g.doc.save('/tmp/x.html')

g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
repo_url = home_url + '?tab=repositories'

g.go(repo_url)

for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
    print('%s: %s' % (elem.text(),
                      g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

import logging

from grab.spider import Spider, Task

logging.basicConfig(level=logging.DEBUG)


class ExampleSpider(Spider):
    def task_generator(self):
        for lang in 'python', 'ruby', 'perl':
            url = 'https://www.google.com/search?q=%s' % lang
            yield Task('search', url=url, lang=lang)

    def task_search(self, grab, task):
        print('%s: %s' % (task.lang,
                          grab.doc('//div[@class="s"]//cite').text()))


bot = ExampleSpider(thread_number=2)
bot.run()

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grab-1.2.0.tar.gz (4.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grab-1.2.0-py2.py3-none-any.whl (91.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file grab-1.2.0.tar.gz.

File metadata

  • Download URL: grab-1.2.0.tar.gz
  • Upload date:
  • Size: 4.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for grab-1.2.0.tar.gz
Algorithm Hash digest
SHA256 0c9159328007dbb4cbccbf3c9ad63d3acd36d0c33aa30a6e7ca31bae43d32203
MD5 6fc5205607f24d424aa8982b99e8d2a2
BLAKE2b-256 dc6032186e18b5e4219324ac9e7b28bcd23c4fff1b109627c276317aa5ee16a1

See more details on using hashes here.

File details

Details for the file grab-1.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: grab-1.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 91.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for grab-1.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 adc0a1214054781fe67a4be0f8e7a2c2fa63ff18592f1a4da90c1d8c3dae7c5c
MD5 89fe56f55e81a6d063e6fa8e4dac0d6a
BLAKE2b-256 39e5ac8e44aa6f5ef85dcc8626b40c73897985e546323938753bb67dabe8f75a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page