xpath/css based scraper with pagination
Project description
Hodor 
A simple html scraper with xpath or css.
Install
pip install hodorlive
Usage
As python package
WARNING: This package by default doesn't verify ssl connections. Please check the arguments to enable them.
Sample code
from hodor import Hodor
from dateutil.parser import parse
def date_convert(data):
return parse(data)
url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'
CONFIG = {
'old_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(1)',
'many': True
},
'new_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(2)',
'many': True
},
'effective_date': {
'css': '#SymbolChangeList_table tr td:nth-child(3)',
'many': True,
'transform': date_convert
},
'_groups': {
'data': '__all__',
'ticker_changes': ['old_symbol', 'new_symbol']
},
'_paginate_by': {
'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
'many': False
}
}
h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)
h.data
Sample output
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC',
'old_symbol': 'AA'},
{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC$',
'old_symbol': 'AA$'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN8',
'old_symbol': 'AHUSDN2018'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN9',
'old_symbol': 'AHUSDN2019'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ6',
'old_symbol': 'AHUSDQ2016'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ7',
'old_symbol': 'AHUSDQ2017'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ8',
'old_symbol': 'AHUSDQ2018'}]}
Arguments
ua(User-Agent)proxies(check requesocks)authcrawl_delay(crawl delay in seconds across pagination - default: 3 seconds)pagination_max_limit(max number of pages to crawl - default: 100)ssl_verify(default: False)robots(if set respects robots.txt - default: True)reppy_capacity(robots cache LRU capacity - default: 100)trim_values(if set trims output for leading and trailing whitespace - default: True)
Config parameters:
- By default any key in the config is a rule to parse.
- Each rule can be either a
xpathor acss - Each rule can extract
manyvalues by default unless explicity set toFalse - Each rule can allow to
transformthe result with a function if provided
- Each rule can be either a
- Extra parameters include grouping (
_groups) and pagination (_paginate_by) which is also of the rule format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hodorlive-1.2.15.tar.gz
(23.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hodorlive-1.2.15.tar.gz.
File metadata
- Download URL: hodorlive-1.2.15.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a23db8a6910f03dcffaae02c20fa79f1f098f6fae46873a354e110fb48822eb
|
|
| MD5 |
7432ecf19234ed4f5668be5cf104b378
|
|
| BLAKE2b-256 |
9aa0ee2461d369d117f805def607697bafffafdd8c74b4817f53066fb69e9b2a
|
File details
Details for the file hodorlive-1.2.15-py3-none-any.whl.
File metadata
- Download URL: hodorlive-1.2.15-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43bc29e940fb6174050d27d498253a05cba2cf5a11b87ad5d32f6edc8b64c715
|
|
| MD5 |
37ff9b8c379bf936d6177aa977e5311f
|
|
| BLAKE2b-256 |
a2f87b8cb7221f4c7b8bf6deaad6fe38adc4daacbdc1ba78f8e29077dba91e00
|