Skip to main content

Scraptor scraping micro framework

Project description

Scraptor is a pretentious - pretentious because it cannot even do half of the features it aims (yet) - scraping framework that wants to scale and wants to grow. Scraptor is a child T-Rex scrapper and is still learning a lot. Maybe one day scraptor will live up to his goals.

Scraptor defines data as sets of fields. In order to specify a field you use the decorator @field and specify a callback function that handles the result before it is saved. A field can take several parameters. The syntax for defining a field is:
@field(css_selector, name, attr)
def callback(field_value):
# Do something with field_value before saving
return field_value
# 'css_selector' and 'name' are required, 'attr' is optional
The following field deletes the characters 'http' and 'https' from links
@field('a', name = "link", attr = 'href')
def clean(link):
return link.replace("http://","").replace("https://","")
In case the attr is ommitted, the field returns the text value of the element
@field('p', name='paragraph'):
def censor(text):
replacement_dictionary = [ ("fuck", "great"), ("shit","nice") ]
for word in replacement_dictionary:
text.replace(word[0], word[1])
return text
After defining all the fields you call run with the url to scrape and the css selector (nodeOfType) that defines a container node. If nodeOfType is ommitted the container node is the whole document.
run(url = "", nodeOfType = ".MomentCapsuleSummary")

The following example extracts the url of the image and the title of twitters moments. It is saved as
from scraptor import *

@field(".MomentCapsuleDetails-title", name="title")
def y(x): return x

@field(".MomentMediaItem-entity--image", name="imagesURL", attr = "src")
def y(x): return x

run(url = "", nodeOfType = ".MomentCapsuleSummary")

# RESULT EXAMPLE - RUN on monday November 23rd, 2015
# {'imagesURL': u'', 'title': u'"Anti-Muslim is Anti-American" column sparks controversy'}
# {'imagesURL': u'', 'title': u'LeBron & Steph continue NBA domination'}
# {'imagesURL': u'', 'title': u'When Slack goes down'}
# {'imagesURL': u'', 'title': u'Celebrities only black people know'}
# {'imagesURL': u'', 'title': u"New Game of Thrones poster teases Jon Snow's fate"}
# {'imagesURL': u'', 'title': u'Mouth-watering Thanksgiving spreads'}
# {'imagesURL': u'', 'title': u'Show us your fat pets'}
# {'imagesURL': u'', 'title': u'Happy Doctor Who Day, Whovians'}

Implementation of the following:

Class | Descrition
------------------------ | ------------------------
class Storage | Backend for saving. Currently aiming towards Firebase, and files of type CSV, XML, HTML, and JSON.
class Formats | Used by storage
class Paginations | Decision tree for finding pagination dom elements or use actions to continue scraping.
class Instructions | Maybe a cli ?
class ImageStorages | Only aiming at Imgurl

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release. See tutorial on generating distribution archives.

Built Distribution

scraptor-0.2.0-py2-none-any.whl (8.4 kB view hashes)

Uploaded py2

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page