Ramby is a simple way to setup a webscraper
Project description
Ramby
Ramby is a simple way to setup a webscraper.
Installation
pip install ramby
Examples
from ramby import Ramby
scraper = Ramby('./exapmles/hackernews.yaml')
data = scraper.scrape("https://news.ycombinator.com/item?id=32237445")
Configuration
A configuration file needs two fields, HOST
and RULES
.
HOST
The HOST
holds the base domain of the site you which to scrape, also keep in mind an error would be thrown if you choose to scrape a URL
with a different HOST
.
So in practice the HOST
would be added to the configuration like so:
host: example.com
RULES
A RULE
is basically a way to target certain elements in a webpage. For example you want to select all the titles of the top posts in hackernews you'd select them like so:
host: news.ycombinator.com
rules:
hompage:
pattern: '/' # The `/` path signifies we use the `homepage` rule
topics: # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors
title: # An object property
selector: '.athing .title > a' # The title target
text: true # We would want the text inside the target element
# html: true is optional
count: 2 # The amount of elements to return
attrs: # Specify the html attributes you'd want
- href # Also taking the link to the post
Sample returned Object based on the rules above
{'topics': {'title': {0: {'attrs': {'href': 'https://paulbutler.org/2022/why-is-it-so-hard-to-give-google-money/'},
'text': 'Why is it so hard to give Google money?'},
1: {'attrs': {'href': 'https://mullvad.net/en/blog/2022/7/26/mullvad-is-now-available-on-amazon-us-se/'},
'text': 'Mullvad is now available on Amazon'}}}}
And if you choose to scrape a post and it's comments
host: news.ycombinator.com
rules:
hompage:
pattern: '/' # The `/` path signifies we use the `homepage` rule
topics: # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors
title: # An object property
selector: '.athing .title > a' # The title target
text: true # We would want the text inside the target element
# html: true is optional
count: 2 # The amount of elements to return
attrs: # Specify the html attributes you'd want
- href # Also taking the link to the post
posts:
pattern: /item/
post:
title:
selector: '.fatitem:first-child .title > a'
count: 1
text: true
attrs:
- href
comments:
texts:
selector: '.comment .commtext'
count: 2
text: true
Sample returned Object based on the rules above
{'comments': {'texts': {0: {'text': 'Wonder how much money & resources Shopify '
'spent on all of their NFT features & '
'integrations over the last months, how '
'many people worked on it and how many of '
"those are part of the lay-off now. I'd "
"guess the support you'd need to provide "
'for it and their tokengated commerce '
"isn't little either.Tobi removed all the "
'NFT stuff from his Twitter profile and '
"didn't tweet much about it for months "
'now, after being pretty vocal about it '
'until earlier this year.Would love to '
'hear his real thoughts on it and why '
'he/they even (seemingly) invested so much '
'into it. One of the few things I never '
'got about Tobi / Shopify. Just seemed so '
'late and weird to be so bullish there. '
"Don't think he's the kind of person to "
'push it just for personal gain, nor that '
"he'd have to, but ..."},
1: {'text': 'I’m honestly still in disbelief at how '
'many very smart people fell for the NFT '
'trap. If you’ve spent even a single bull '
'cycle in the crypto community you could '
'tell right away NFTs we’re ICO level '
'scams. The mental gymnastics very smart '
'and technical people performed to '
'rationalize paying for a jpeg still makes '
'me question reality. I participate in '
'crypto because I take a calculated risk, '
'and I’m comfortable gambling. People who '
'actually think something like an NFT has '
'any real value still messes with my head. '
'I really can’t grasp how they actually '
'believe this. And yes, I understand '
'technically how NFTs work.'}}},
'post': {'title': {0: {'attrs': {'href': 'https://www.wsj.com/articles/shopify-to-lay-off-10-of-workers-in-broad-shake-up-11658839047'},
'text': 'Shopify to lay off 10% of workers in broad '
'shake-up'}}}}
See more examples here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ramby-0.0.5.tar.gz
.
File metadata
- Download URL: ramby-0.0.5.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef477c7bb6b9af1899c462153431d730d006df5c4c0056a171e4c1de83cc0ee3 |
|
MD5 | 00ce76dc138f984a6b5f665d549b3aaa |
|
BLAKE2b-256 | e7d907bb1b093821657a015b8a9bbe9b244676c2b5834e64a7542e3abfcf3469 |
File details
Details for the file ramby-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: ramby-0.0.5-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c505bd8f1a4ec4dd2a600b7b137152ef44f9fc888260eb46d28ce09cac76b1bf |
|
MD5 | 84655762a0affa1a42c61e0b5d63cecf |
|
BLAKE2b-256 | 45a32a88b0abb812e7eacc381110b2f1725084a4cbc4ceadbdab890afb45ca0e |