Framework/tool for web scraping.
Project description
Ghettobird Scraping Framework/Tool
The goal of this project is to:
- reduce scrapers down into a single JSON object, with a few auxilary functions
By doing this we can:
- reduce the amount of boilerplate code in scraping projects.
- make scrapers readable
- group fragile code together for easy maintenance
- reduce inconsistancies in error handling
Bonus:
- if the structure of scrapers can be simplified, we can automate the generation of scrapers
This tool does not yet include:
- IP rotation, networking, or security
Why?
- Unethical, but more importantly, I know nothing about it
The Ghettobird Dictionary/JSON
Reserved keywords: "path", "transformer", "args, "iterate"
This summary will focus on the FLIGHTPATH, which will guide the scraper to our desired data, and provide intructions when it encounters its data
The structure of the FLIGHTPATH is preserved when results are yielded, with the exception of "paths" and "transformer" functions, which will be replaced with our desired values
Sample tranformer function:
def TRANSFORM_get_value(element): return element.get("text")
"url": "http://ghettobird.sample.s3-website.us-east-2.amazonaws.com", "flightpath": { "header": { "path": "//*[@class='page-header']", "transformer": TRANSFORM_get_value } }
The result ------------------------->
"flightpath": { "job_title": "Jobs in St. Louis, Missouri" }
An element is selected by xpaths and TRANSFORM function generally grabs the data, and/or modifies the data after.
However, in the absence of a transform function, element.get("text) will always be called. Meaning that a transform function would be necessary for something like an input field. But is not generally necessary if you are just grabbing plain text from a DIV, SPAN or P tag.
Similiarly, the "path" field in a dictionary is not always needed.
Example:
"flightpath": { "salary_range": { "path": "//*[@id='salary-query']", "transformer": TRANSFORM_get_text }, }
Because the transformer function can be dropped in this situation, and "path" would be the only key within our "id_tech_jobsopen" object, we can actually remove both keys:
"flightpath: { "salary_range": "//*[@id='salary-query']", }
The result ------------------------->
"flightpath": { "salary_range": "Salary Range: 10,000 - 100,000" }
Values that are objects {} or strings, will find only the first element that matches a given xpath.
Wrapping a dictionary value in an array, will find all results matching the xpath.
"flightpath": { "job_titles": ["//*[@class='title']"], }
The result ------------------------->
"flightpath": { "job_titles": ["Senior Software Dev", "Agile Coach", "Software Engineer", "Junior Software Dev", "Ping Pong Player"] }
It is often necessary to couple/group fields.
"flightpath": { "job_titles": ["//[@class='title']"], "job_descriptions": ["//[@class='description']"], }
This flightpath would return two arrays, but there would be nothing binding a title to its associated description.
To do this, we must use the "iterate" keyword.
Coupling fields:
"jobs": [{ "iterate": "//div[@class='job']", "title": ".//[@class='title']", "description": ".//[@class='description']", }],
Notice that we have an array of objects. The keyword "iterate" is necessary. As is the period that precedes the xpath.
Iterate will loop through all divs with the class of "job", and then try to find elements with xpaths of tittle, and description.
The result ------------------------->
"flightpath": { jobs: [{ "title": "Senior Software Dev", "description": "Need a master of React Native, a man or woman with gumption, who can lead a team." }, { "title": "Agile Coach", "description": "Need a trained agility coach who can combat followers of the waterfall method. Zealotry required." }, ...] }
(4 more entries would follow)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ghettobird-0.0.3.tar.gz
.
File metadata
- Download URL: ghettobird-0.0.3.tar.gz
- Upload date:
- Size: 2.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc0d916d47d0e5def5fa5fbfea9a065a2733bc14fa0615ea07677b4622192a2e |
|
MD5 | 432306ed8099b7f7b357ae4c6ceeeed5 |
|
BLAKE2b-256 | 7ffa79326c9090fb6a9bdcf25f9f1f0787741cd87352ddcc7a0453b7b2f2b999 |
File details
Details for the file ghettobird-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: ghettobird-0.0.3-py3-none-any.whl
- Upload date:
- Size: 2.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ffebd52786154f71968750468c307d6d597350b28afa3b6c8459e2a5b4c66e2 |
|
MD5 | 26fd5ef6a66a178e74ec24985f092ffd |
|
BLAKE2b-256 | 2b44bc09ba7e41b3f2b7ff088504b3d3a715e5f70fd3784f15d16fc9fceeeb9d |