xpath/css based scraper with pagination
Project description
Hodor 
A simple html scraper with xpath or css.
Install
pip install hodorlive
Usage
As python package
WARNING: This package by default doesn't verify ssl connections. Please check the arguments to enable them.
Sample code
from hodor import Hodor
from dateutil.parser import parse
def date_convert(data):
return parse(data)
url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'
CONFIG = {
'old_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(1)',
'many': True
},
'new_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(2)',
'many': True
},
'effective_date': {
'css': '#SymbolChangeList_table tr td:nth-child(3)',
'many': True,
'transform': date_convert
},
'_groups': {
'data': '__all__',
'ticker_changes': ['old_symbol', 'new_symbol']
},
'_paginate_by': {
'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
'many': False
}
}
h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)
h.data
Sample output
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC',
'old_symbol': 'AA'},
{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC$',
'old_symbol': 'AA$'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN8',
'old_symbol': 'AHUSDN2018'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN9',
'old_symbol': 'AHUSDN2019'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ6',
'old_symbol': 'AHUSDQ2016'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ7',
'old_symbol': 'AHUSDQ2017'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ8',
'old_symbol': 'AHUSDQ2018'}]}
Arguments
ua(User-Agent)proxies(check requesocks)authcrawl_delay(crawl delay in seconds across pagination - default: 3 seconds)pagination_max_limit(max number of pages to crawl - default: 100)ssl_verify(default: False)robots(if set respects robots.txt - default: True)reppy_capacity(robots cache LRU capacity - default: 100)trim_values(if set trims output for leading and trailing whitespace - default: True)
Config parameters:
- By default any key in the config is a rule to parse.
- Each rule can be either a
xpathor acss - Each rule can extract
manyvalues by default unless explicity set toFalse - Each rule can allow to
transformthe result with a function if provided
- Each rule can be either a
- Extra parameters include grouping (
_groups) and pagination (_paginate_by) which is also of the rule format.
Building & Publishing
Prerequisites
- Install uv.
- Review the uvx execution model for running tools without global installs.
- Hatch documentation: https://hatch.pypa.io/latest/.
Build workflow
Run the release helper to build and publish wheels and source archives via Hatch:
./upload.sh
The script shells out to uvx hatch build followed by uvx hatch publish so that Hatch is executed in an ephemeral environment.
Publishing requirements
Configure credentials in ~/.pypirc as described in the PyPI configuration specification.
Example configuration:
[distutils]
index-servers =
pypi
testpypi
[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = <pypi-token>
[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = <testpypi-token>
Replace token placeholders with secrets from the team password manager and avoid committing the file to version control.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hodorlive-1.2.17.tar.gz.
File metadata
- Download URL: hodorlive-1.2.17.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54a26e7322b1b64b117038c58625dc34f2810929b11d955b32aaaab1a3651248
|
|
| MD5 |
7c8f346ed5e579c328f70b61410b1d06
|
|
| BLAKE2b-256 |
55e4f21907dc770c3784218b7fdf1e33575c50a68f7f0b379159cf2e65666cba
|
File details
Details for the file hodorlive-1.2.17-py3-none-any.whl.
File metadata
- Download URL: hodorlive-1.2.17-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da021b8d5f39401df9bc0f5a9d09458ffc7d6ca8ceb30639e62ccb18d7867059
|
|
| MD5 |
7ee85475c61e27cb49cb4b9aea9e5295
|
|
| BLAKE2b-256 |
988489926f95ceebbcfecb0da3834260b1124e82975ddb7dea7ca146652aa812
|