Skip to main content

Framework/tool for web scraping.

Project description

Ghettobird Scraping Framework/Tool

The goal of this project is to:

  • reduce scrapers down into a single JSON object, with a few auxilary functions

By doing this we can:

  • reduce the amount of boilerplate code in scraping projects.
  • make scrapers readable
  • group fragile code together for easy maintenance
  • reduce inconsistancies in error handling

Bonus:

  • if the structure of scrapers can be simplified, we can automate the generation of scrapers

This tool does not yet include:

  • IP rotation, networking, or security
Why?
    - Unethical, but more importantly, I know nothing about it

The Ghettobird Dictionary/JSON

Reserved keywords: "path", "transformer", "args, "iterate"

This summary will focus on the FLIGHTPATH, which will guide the scraper to our desired data, and provide intructions when it encounters its data

The structure of the FLIGHTPATH is preserved when results are yielded, with the exception of "paths" and "transformer" functions, which will be replaced with our desired values

Sample tranformer function:

def TRANSFORM_get_value(element): return element.get("text")

"url": "http://ghettobird.sample.s3-website.us-east-2.amazonaws.com", "flightpath": { "header": { "path": "//*[@class='page-header']", "transformer": TRANSFORM_get_value } }

The result ------------------------->

"flightpath": { "job_title": "Jobs in St. Louis, Missouri" }


An element is selected by xpaths and TRANSFORM function generally grabs the data, and/or modifies the data after.

However, in the absence of a transform function, element.get("text) will always be called. Meaning that a transform function would be necessary for something like an input field. But is not generally necessary if you are just grabbing plain text from a DIV, SPAN or P tag.

Similiarly, the "path" field in a dictionary is not always needed.

Example:

"flightpath": { "salary_range": { "path": "//*[@id='salary-query']", "transformer": TRANSFORM_get_text }, }

Because the transformer function can be dropped in this situation, and "path" would be the only key within our "id_tech_jobsopen" object, we can actually remove both keys:

"flightpath: { "salary_range": "//*[@id='salary-query']", }

The result ------------------------->

"flightpath": { "salary_range": "Salary Range: 10,000 - 100,000" }


Values that are objects {} or strings, will find only the first element that matches a given xpath.

Wrapping a dictionary value in an array, will find all results matching the xpath.

"flightpath": { "job_titles": ["//*[@class='title']"], }

The result ------------------------->

"flightpath": { "job_titles": ["Senior Software Dev", "Agile Coach", "Software Engineer", "Junior Software Dev", "Ping Pong Player"] }


It is often necessary to couple/group fields.

"flightpath": { "job_titles": ["//[@class='title']"], "job_descriptions": ["//[@class='description']"], }

This flightpath would return two arrays, but there would be nothing binding a title to its associated description.

To do this, we must use the "iterate" keyword.

Coupling fields:

"jobs": [{ "iterate": "//div[@class='job']", "title": ".//[@class='title']", "description": ".//[@class='description']", }],

Notice that we have an array of objects. The keyword "iterate" is necessary. As is the period that precedes the xpath.

Iterate will loop through all divs with the class of "job", and then try to find elements with xpaths of tittle, and description.

The result ------------------------->

"flightpath": { jobs: [{ "title": "Senior Software Dev", "description": "Need a master of React Native, a man or woman with gumption, who can lead a team." }, { "title": "Agile Coach", "description": "Need a trained agility coach who can combat followers of the waterfall method. Zealotry required." }, ...] }

(4 more entries would follow)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghettobird-0.0.3.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

ghettobird-0.0.3-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file ghettobird-0.0.3.tar.gz.

File metadata

  • Download URL: ghettobird-0.0.3.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.4

File hashes

Hashes for ghettobird-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fc0d916d47d0e5def5fa5fbfea9a065a2733bc14fa0615ea07677b4622192a2e
MD5 432306ed8099b7f7b357ae4c6ceeeed5
BLAKE2b-256 7ffa79326c9090fb6a9bdcf25f9f1f0787741cd87352ddcc7a0453b7b2f2b999

See more details on using hashes here.

File details

Details for the file ghettobird-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: ghettobird-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 2.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.4

File hashes

Hashes for ghettobird-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0ffebd52786154f71968750468c307d6d597350b28afa3b6c8459e2a5b4c66e2
MD5 26fd5ef6a66a178e74ec24985f092ffd
BLAKE2b-256 2b44bc09ba7e41b3f2b7ff088504b3d3a715e5f70fd3784f15d16fc9fceeeb9d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page