HTTP GET with JS rendering support, to get the rendered HTML from a page easily
Project description
get-html: get raw or rendered HTML (for humans)
Read all the details on how I started implementing this at ⇝ Rendering-HTML_a-journey ⇜
This module is made for anyone needing to scrape HTML (i.e. scrape the web).
It knows how to do only one thing, but does it well: getting HTML from a web page. The kind of HTML is up to you. Either:
- the raw HTML, directly from the server or,
- the rendered HTML, after JS/AJAX calls.
Moreover, it is made such that you can use a unique method throughout your project, and switch between the two behaviors at launch by setting an environment variable.
HtmlRenderer: a class to seamlessly render a page
HtmlRenderer
handles all the specific pyppeteer
stuff for you. It is also thread-safe.
"sync" usage
Here is a typical usage:
from get_html import HtmlRenderer
renderer = HtmlRenderer()
try:
# use the renderer. The underlying browser will be instantiated on first call to render.
response = renderer.render(url='https://xkcd.com')
html = response.text # or resposne.content to get the raw bytes
# ... etc.
finally:
# close the underlying browser
renderer.close()
Or simply use a context manager:
from get_html import create_renderer
with create_renderer() as renderer:
# use the renderer. The underlying browser will be instanciated on first call to render
response = renderer.render(url='https://xkcd.com')
html = response.text # or resposne.content to get the raw bytes
# here, the underlying browser will be closed
If you need to manipulate the page before getting the content, pass an async function to render
.
It will be called after the page loaded, but before the HTML content is fetched.
For example:
from get_html import create_renderer
async def scroll_to_end(page):
# https://github.com/miyakogi/pyppeteer/issues/205#issuecomment-470886682
await page.evaluate('{window.scrollBy(0, document.body.scrollHeight);}')
with create_renderer() as renderer:
response = renderer.render('https://9gag.com', manipulate_page_func=scroll_to_end)
"async" usage
All public methods have an async counterpart. When using async, however, you need to ensure that
- the browser is created once,
- the browser is closed once.
By default, the browser is launched upon first use, usually on the first call of render
.
When using async, this may be a problem, as multiple coroutine will try to create the browser multiple times.
To avoid this, ensure you trigger the browser creation before launching the other tasks (same for closing, wait for all tasks to complete).
A concrete example is available in examples/async_example.py. Here is the gist:
import asyncio
from get_html import HtmlRenderer
loop = asyncio.get_event_loop()
renderer = HtmlRenderer()
# trigger the browser creation only once, before the tasks
loop.run_until_complete(renderer.async_browser)
# .. TASKS WITH RENDERING CALLS ...
# e.g. loop.run_until_complete(someRenderingTask())
# finally close the browser once the tasks completed
loop.run_until_complete(renderer.async_close())
do_get: seemlessly switch between behaviors
from get_html.env_defined_get import do_get
response = do_get('https://xkcd.com')
assert response.status_code == 200
html_bytes = response.content
html_string = response.text
The actual behavior of do_get
will depend on the environment variable RENDER_HTML
:
RENDER_HTML=[1|y|true|on]
:do_get
will launch a chromium instance under the hood and render the page (rendered HTML)RENDER_HTML=<anything BUT 2>
(default):do_get
will forward the call torequests.get
(raw HTML). Do NOT use2
before reading through the multi-threading section.
If rendering support is on, a browser instance will be launched on module load, and will be kept alive throughout the life of the application. Keep that in mind if you have low-memory (chromium !!).
Multi-threading
HtmlRenderer
is thread-safe.
For do_get
with rendering support, there are two possibilities.
- Create only one browser, shared by all threads. In this case, only one thread can execute
render
at a time (locking mechanism); - Create one browser per thread. In this case, threads can render in parallel. But be careful, each time a new thread calls
do_get
, a new browser is launched, that will keep running until the end of the program (or until you callget_html.env_defined_get.close()
).
Enable mode (2) by setting RENDER_HTML=2
. But again, ensure you don't have too many threads, since chromium needs a lot of memory.
Running tests
On Windows/Linux:
pip install tox
tox
On Mac (see https://github.com/tox-dev/tox/issues/1485):
pip install "tox<3.7"
tox
Common errors
libXX not found
(Linux)
sudo apt-get install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file get-html-0.0.3.tar.gz
.
File metadata
- Download URL: get-html-0.0.3.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4903dc04972ef7530ee1b91def546b05288f8767143a02231fe055f893642f5 |
|
MD5 | 16f20475d00b36c733aeeeeeb0e0361e |
|
BLAKE2b-256 | 1f70b79b01ea50a326889a95620c3a32e679970946a7131ba59295cc9b745852 |