Skip to main content

Ramby is a simple way to setup a webscraper

Project description

Ramby

Ramby is a simple way to setup a webscraper.

Installation

pip install ramby

Examples

from ramby import Ramby

scraper = Ramby('./exapmles/hackernews.yaml')
data = scraper.scrape("https://news.ycombinator.com/item?id=32237445")

Configuration

A configuration file needs two fields, HOST and RULES.

HOST

The HOST holds the base domain of the site you which to scrape, also keep in mind an error would be thrown if you choose to scrape a URL with a different HOST.

So in practice the HOST would be added to the configuration like so:

host: example.com

RULES

A RULE is basically a way to target certain elements in a webpage. For example you want to select all the titles of the top posts in hackernews you'd select them like so:

host: news.ycombinator.com

rules:
    hompage:
        pattern: '/' # The `/` path signifies we use the `homepage` rule 
        topics:    # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors
            title: # An object property
                selector: '.athing .title > a' # The title target
                text: true                     # We would want the text inside the target element
                # html: true is optional
                count: 2                       # The amount of elements to return
                attrs:                         # Specify the html attributes you'd want
                    - href                     # Also taking the link to the post

Sample returned Object based on the rules above

{'topics': {'title': {0: {'attrs': {'href': 'https://paulbutler.org/2022/why-is-it-so-hard-to-give-google-money/'},
                          'text': 'Why is it so hard to give Google money?'},
                      1: {'attrs': {'href': 'https://mullvad.net/en/blog/2022/7/26/mullvad-is-now-available-on-amazon-us-se/'},
                          'text': 'Mullvad is now available on Amazon'}}}}

And if you choose to scrape a post and it's comments

host: news.ycombinator.com

rules:
    hompage:
        pattern: '/' # The `/` path signifies we use the `homepage` rule 
        topics:    # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors
            title: # An object property
                selector: '.athing .title > a' # The title target
                text: true                     # We would want the text inside the target element
                # html: true is optional
                count: 2                       # The amount of elements to return
                attrs:                         # Specify the html attributes you'd want
                    - href                     # Also taking the link to the post
                  
    posts:
        pattern: /item/
        post:
            title: 
                selector: '.fatitem:first-child .title > a'
                count: 1
                text: true
                attrs: 
                    - href 

        comments:
            texts:
                selector: '.comment .commtext'
                count: 2
                text: true

Sample returned Object based on the rules above

{'comments': {'texts': {0: {'text': 'Wonder how much money & resources Shopify '
                                    'spent on all of their NFT features & '
                                    'integrations over the last months, how '
                                    'many people worked on it and how many of '
                                    "those are part of the lay-off now. I'd "
                                    "guess the support you'd need to provide "
                                    'for it and their tokengated commerce '
                                    "isn't little either.Tobi removed all the "
                                    'NFT stuff from his Twitter profile and '
                                    "didn't tweet much about it for months "
                                    'now, after being pretty vocal about it '
                                    'until earlier this year.Would love to '
                                    'hear his real thoughts on it and why '
                                    'he/they even (seemingly) invested so much '
                                    'into it. One of the few things I never '
                                    'got about Tobi / Shopify. Just seemed so '
                                    'late and weird to be so bullish there. '
                                    "Don't think he's the kind of person to "
                                    'push it just for personal gain, nor that '
                                    "he'd have to, but ..."},
                        1: {'text': 'I’m honestly still in disbelief at how '
                                    'many very smart people fell for the NFT '
                                    'trap. If you’ve spent even a single bull '
                                    'cycle in the crypto community you could '
                                    'tell right away NFTs we’re ICO level '
                                    'scams. The mental gymnastics very smart '
                                    'and technical people performed to '
                                    'rationalize paying for a jpeg still makes '
                                    'me question reality. I participate in '
                                    'crypto because I take a calculated risk, '
                                    'and I’m comfortable gambling. People who '
                                    'actually think something like an NFT has '
                                    'any real value still messes with my head. '
                                    'I really can’t grasp how they actually '
                                    'believe this. And yes, I understand '
                                    'technically how NFTs work.'}}},
 'post': {'title': {0: {'attrs': {'href': 'https://www.wsj.com/articles/shopify-to-lay-off-10-of-workers-in-broad-shake-up-11658839047'},
                        'text': 'Shopify to lay off 10% of workers in broad '
                                'shake-up'}}}}

See more examples here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ramby-0.0.5.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

ramby-0.0.5-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file ramby-0.0.5.tar.gz.

File metadata

  • Download URL: ramby-0.0.5.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.11

File hashes

Hashes for ramby-0.0.5.tar.gz
Algorithm Hash digest
SHA256 ef477c7bb6b9af1899c462153431d730d006df5c4c0056a171e4c1de83cc0ee3
MD5 00ce76dc138f984a6b5f665d549b3aaa
BLAKE2b-256 e7d907bb1b093821657a015b8a9bbe9b244676c2b5834e64a7542e3abfcf3469

See more details on using hashes here.

File details

Details for the file ramby-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: ramby-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.11

File hashes

Hashes for ramby-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c505bd8f1a4ec4dd2a600b7b137152ef44f9fc888260eb46d28ce09cac76b1bf
MD5 84655762a0affa1a42c61e0b5d63cecf
BLAKE2b-256 45a32a88b0abb812e7eacc381110b2f1725084a4cbc4ceadbdab890afb45ca0e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page