scraptor

Scraptor scraping micro framework

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- Software Development :: Build Tools

Project description

Scraptor
=======
Scraptor is a pretentious - pretentious because it cannot even do half of the features it aims (yet) - scraping framework that wants to scale and wants to grow. Scraptor is a child T-Rex scrapper and is still learning a lot. Maybe one day scraptor will live up to his goals.

Syntax
=======
Scraptor defines data as sets of fields. In order to specify a field you use the decorator @field and specify a callback function that handles the result before it is saved. A field can take several parameters. The syntax for defining a field is:
```python
@field(css_selector, name, attr)
def callback(field_value):
# Do something with field_value before saving
return field_value
# 'css_selector' and 'name' are required, 'attr' is optional
```
The following field deletes the characters 'http' and 'https' from links
```python
@field('a', name = "link", attr = 'href')
def clean(link):
return link.replace("http://","").replace("https://","")
```
In case the attr is ommitted, the field returns the text value of the element
```python
@field('p', name='paragraph')
def censor(text):
replacement_dictionary = [ ("fuck", "great"), ("shit","nice") ]
for word in replacement_dictionary:
text.replace(word[0], word[1])
return text
```
After defining all the fields you call run with the url to scrape and the css selector (nodeOfType) that defines a container node. If nodeOfType is ommitted the container node is the whole document.
```python
run(url = "https://twitter.com/i/moments", nodeOfType = ".MomentCapsuleSummary")
```

Example
=======
The following example extracts the url of the image and the title of twitters moments. It is saved as example_links.py
```python
from scraptor import *

@field(".MomentCapsuleDetails-title", name="title")
def y(x): return x

@field(".MomentMediaItem-entity--image", name="imagesURL", attr = "src")
def y(x): return x

run(url = "https://twitter.com/i/moments", nodeOfType = ".MomentCapsuleSummary")

# RESULT EXAMPLE - RUN on monday November 23rd, 2015
# {'imagesURL': u'https://pbs.twimg.com/media/CUhQSWoWEAA1tis.jpg:large', 'title': u'"Anti-Muslim is Anti-American" column sparks controversy'}
# {'imagesURL': u'https://pbs.twimg.com/media/CUDBMH2WwAEF75C.jpg:large', 'title': u'LeBron & Steph continue NBA domination'}
# {'imagesURL': u'https://pbs.twimg.com/media/CUhYzbRU8AAlaoT.png:large', 'title': u'When Slack goes down'}
# {'imagesURL': u'https://pbs.twimg.com/media/CUdO5giUcAE8oMT.jpg:large', 'title': u'Celebrities only black people know'}
# {'imagesURL': u'https://pbs.twimg.com/media/CUghf-tWsAQ5ftS.jpg:large', 'title': u"New Game of Thrones poster teases Jon Snow's fate"}
# {'imagesURL': u'https://o.twimg.com/2/proxy.jpg?t=HBiTAWh0dHBzOi8vdi5jZG4udmluZS5jby9yL3ZpZGVvcy9FQkM1Q0FERUFGMTE0OTkwNDIzMjA3MDE4MDg2NF8zOGI3OGNhZWZhMC4xLjEuOTU5NzYzNDQ2MjUwNTExMzc0Ny5tcDQuanBnP3ZlcnNpb25JZD01eU54dXFnX2NrbHhoWW8zamlGRzd5UHEuWHhCVXYyMBTABxTABwAWABIA&s=xlxoIi9Ri3VEJqq8cHVbcS04UE2-2lu32hf-r4rilsU', 'title': u'Mouth-watering Thanksgiving spreads'}
# {'imagesURL': u'https://o.twimg.com/2/proxy.jpg?t=HBiUAWh0dHBzOi8vdi5jZG4udmluZS5jby9yL3ZpZGVvcy8zQTVBMEVDMjlFMTI3NjA1NDA3MTQ0MjM5NTEzNl80N2MzMjAzMjVhNi4zLjAuMTgwNjI0NjIyNDA1Njc2NDMxMjMubXA0LmpwZz92ZXJzaW9uSWQ9UUsycUZsbUM4NkFZVGdidHd0OE9KYUoya2R1ODBkQnkUwAcUwAcAFgASAA&s=PS2LPX-HQMWYau5Rvj5SXvdMuGVFp0Q1ILd8Ead3QZo', 'title': u'Show us your fat pets'}
# {'imagesURL': u'https://pbs.twimg.com/tweet_video_thumb/CUf9-rSW4AA3DWC.png', 'title': u'Happy Doctor Who Day, Whovians'}
```

TODO
=======
Implementation of the following:

Class | Descrition
------------------------ | ------------------------
class Storage | Backend for saving. Currently aiming towards Firebase, and files of type CSV, XML, HTML, and JSON.
class Formats | Used by storage
class Paginations | Decision tree for finding pagination dom elements or use actions to continue scraping.
class Instructions | Maybe a cli ?
class ImageStorages | Only aiming at Imgurl

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

0.5.0

Feb 18, 2016

0.2.2

Nov 25, 2015

0.2.1

Nov 25, 2015

0.2.0

Nov 25, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraptor-0.5.0-py2-none-any.whl (11.7 kB view details)

Uploaded Feb 18, 2016 Python 2

File details

Details for the file scraptor-0.5.0-py2-none-any.whl.

File metadata

Download URL: scraptor-0.5.0-py2-none-any.whl
Upload date: Feb 18, 2016
Size: 11.7 kB
Tags: Python 2
Uploaded using Trusted Publishing? No

File hashes

Hashes for scraptor-0.5.0-py2-none-any.whl
Algorithm	Hash digest
SHA256	`5d6e77c14219d4e5f8605e1fb7756970c8cc8e6200d8d5ce984a1dbf5a83a321`
MD5	`179085b8a34d2705031505e1c4519b7e`
BLAKE2b-256	`0162724795ac34598463ae41dfe0cafead1759ce70cc27b4f8f07dd6f2d63ddf`

See more details on using hashes here.

scraptor 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes