xpath/css based scraper with pagination
Project description
Hodor 
A simple html scraper with xpath or css.
Install
pip install hodorlive
Usage
As python package
WARNING: This package by default doesn't verify ssl connections. Please check the arguments to enable them.
Sample code
from hodor import Hodor
from dateutil.parser import parse
def date_convert(data):
return parse(data)
url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'
CONFIG = {
'old_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(1)',
'many': True
},
'new_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(2)',
'many': True
},
'effective_date': {
'css': '#SymbolChangeList_table tr td:nth-child(3)',
'many': True,
'transform': date_convert
},
'_groups': {
'data': '__all__',
'ticker_changes': ['old_symbol', 'new_symbol']
},
'_paginate_by': {
'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
'many': False
}
}
h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)
h.data
Sample output
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC',
'old_symbol': 'AA'},
{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC$',
'old_symbol': 'AA$'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN8',
'old_symbol': 'AHUSDN2018'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN9',
'old_symbol': 'AHUSDN2019'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ6',
'old_symbol': 'AHUSDQ2016'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ7',
'old_symbol': 'AHUSDQ2017'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ8',
'old_symbol': 'AHUSDQ2018'}]}
Arguments
ua(User-Agent)proxies(check requesocks)authcrawl_delay(crawl delay in seconds across pagination - default: 3 seconds)pagination_max_limit(max number of pages to crawl - default: 100)ssl_verify(default: False)robots(if set respects robots.txt - default: True)reppy_capacity(robots cache LRU capacity - default: 100)trim_values(if set trims output for leading and trailing whitespace - default: True)
Config parameters:
- By default any key in the config is a rule to parse.
- Each rule can be either a
xpathor acss - Each rule can extract
manyvalues by default unless explicity set toFalse - Each rule can allow to
transformthe result with a function if provided
- Each rule can be either a
- Extra parameters include grouping (
_groups) and pagination (_paginate_by) which is also of the rule format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hodorlive-1.2.14.tar.gz
(23.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hodorlive-1.2.14.tar.gz.
File metadata
- Download URL: hodorlive-1.2.14.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf22ca09c3394b5f4cbac0f3995f624defd6acb8fbeff8bfa3177bd0633e3ddf
|
|
| MD5 |
5a31c51efff5e0881682eb5b2df5a6af
|
|
| BLAKE2b-256 |
61667a750587680b7fa548583675938b52399a4d19e06a1ca052845d38f27ffe
|
File details
Details for the file hodorlive-1.2.14-py3-none-any.whl.
File metadata
- Download URL: hodorlive-1.2.14-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06fa1cd0839483f2bf7f333d5bfecb0754916c1beec1990d0de7bf8ab0d29287
|
|
| MD5 |
7d0798f8312988141dddf81c618cab48
|
|
| BLAKE2b-256 |
c93d1004a274ef2ebde795159a9a54f205459b999fb87f966248b7ea935398ba
|