Skip to main content

Framework/tool for web scraping.

Project description

ghettobird

A Python framework/tool designed for web scraping. (This readme and the tool itself are still under construction.)

Installation

Use the package manager pip to install foobar.

pip install ghettobird

Goals

The primary goal of this project is to simplify scraping applications into a single dictionary/JSON object (with a few auxilary functions sprinkled in).

This allows us to:

  • Reduce boilerplate code
  • Increase code readability
  • Group fragile pieces of code for easy maintenance
  • Reduce inconsistancies in error handling

Usage

Usage examples feature this website. It is a static HTML page, but a JS-heavy sample will be added soon.

from ghettobird import fly

Example One: Grabbing a single element

If we wanted to grab a page header from our sample page and we expect only one element to be returned, we could use the following "flightpath":

itinerary = {
    "url": "http://ghettobird.sample.s3-website.us-east-2.amazonaws.com",
    "flightpath": {
        "header": "//*[@class='page-header']",
    },
}

We would be returned with dictionary that follows the blueprint we laid out, but with the data being populated with:

{'header': 'Jobs in St. Louis, Missouri'}

Example Two: Grabbing a list

If we wanted to grab every single job title from our sample page, the following flightpath would be appropriate. Notice the brackets that surround our xpath. This allows us to return multiple values from elements.

itinerary = {
    "url": "http://ghettobird.sample.s3-website.us-east-2.amazonaws.com",
    "flightpath": {
        "titles": ["//h4[@class='title']"],
    },
}

The result:

{'titles': ['Senior Software Dev',
            'Agile Coach',
            'Software Engineer',
            'Junior Software Dev',
            'Ping Pong Player']}

Example Three: Transformer functions

By default, elements that are found with a given xpath have their text values returned unless specified otherwise. However, if we need to perform some sort of transformation on the element or get an HREF rather than text, "transformer" functions will be necessary.

from ghettobird import fly, transformer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghettobird-0.0.45.tar.gz (8.4 kB view details)

Uploaded Source

File details

Details for the file ghettobird-0.0.45.tar.gz.

File metadata

  • Download URL: ghettobird-0.0.45.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.4

File hashes

Hashes for ghettobird-0.0.45.tar.gz
Algorithm Hash digest
SHA256 b4a6d515b5ebcbbe37f51c324850e1eb337f726477d69aa2e5a38acbbac73ae8
MD5 c673fa14b4f6658136d50f6275dcf07b
BLAKE2b-256 139c04a5f787ca5778153278c27da6d3df2383b08ba1a74211fd377449e019c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page