Chrome controller for Humans, base on Chrome Devtools Protocol(CDP) and python3.7+. Read more: https://github.com/ClericPy/ichrome.
Project description
ichrome
Chrome controller for Humans, base on Chrome Devtools Protocol(CDP) and python3.7+.
Why?
- Pyppeteer is awesome, but I don't need so much
- spelling of pyppeteer is confused
- event-driven programming is not always advisable.
- Selenium is slow
- webdrivers often come with memory leak.
- In desperate need of a stable toolkit to communicate with Chrome browser (or other Blink-based browsers like Chromium)
- fast http & websocket connections (based on aiohttp) for asyncio environment
- ichrome.debugger is a sync tool and depends on the
ichrome.async_utils
- a choice for debugging interactively.
Features
- Chrome process daemon
- auto-restart
- command-line usage support
- async environment compatible
- Connect to an existing Chrome
- Operations on Tabs under stable websocket
- Package very commonly used functions
- ChromeEngine progress pool utils
- support HTTP api router with FastAPI
Install
Install from PyPI
pip install ichrome -U
Uninstall & Clear the user data dir
$ python3 -m ichrome --clean
$ pip uninstall ichrome
Download & unzip the latest version of Chromium browser
python3 -m ichrome --install="/home/root/chrome_bin"
WARNING:
- install the missing packages yourself
- use
ldd chrome | grep not
on linux to check and install them, or view the link: Checking out and building Chromium on Linux
- use
- add
--no-sandbox
to extra_configs to avoidNo usable sandbox
errors, unless you really needsandbox
: Chromium: "Running without the SUID sandbox!" error - Ask Ubuntu
Have a Try?
Interactive Debugging (REPL Mode)
There are two ways to enter the repl mode
python3 -m ichrome -t
- or run
await tab.repl()
in your code
λ python3 -m ichrome -t
>>> tab.goto('https://github.com/ClericPy')
True
>>> title = tab.title
>>> title
'ClericPy (ClericPy) · GitHub'
>>> tab.click('.pinned-item-list-item-content [href="/ClericPy/ichrome"]')
Tag(a)
>>> tab.wait_loading(2)
True
>>> tab.wait_loading(2)
False
>>> tab.js('document.body.innerHTML="Updated"')
{'type': 'string', 'value': 'Updated'}
>>> tab.history_back()
True
>>> tab.set_html('hello world')
{'id': 21, 'result': {}}
>>> tab.set_ua('no UA')
{'id': 22, 'result': {}}
>>> tab.goto('http://httpbin.org/user-agent')
True
>>> tab.html
'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{\n "user-agent": "no UA"\n}\n</pre></body></html>'
Quick Start
-
Start a new chrome daemon process with headless=False
python -m ichrome
- Then connect to an exist chrome instance
async with AsyncChrome() as cd:
or launching the chrome daemon in code may be a better choice
async with AsyncChromeDaemon() as cd: async with cd.connect_tab() as tab:
-
Operations on the tabs: new tab, wait loading, run javascript, get html, close tab
-
Close the browser GRACEFULLY instead of killing process
from ichrome import AsyncChromeDaemon
import asyncio
async def main():
async with AsyncChromeDaemon(headless=0, disable_image=False) as cd:
# index: 0=current activate tab, 1=tab 1, None=new tab, $URL=new tab for url
async with cd.connect_tab(index=0, auto_close=True) as tab:
# tab: AsyncTab
await tab.alert(
'Now using the default tab and goto the url, click to continue.'
)
print(await tab.goto('https://github.com/ClericPy/ichrome',
timeout=5))
# wait tag appeared
await tab.wait_tag('[data-content="Issues"]', max_wait_time=5)
await tab.alert(
'Here the Issues tag appeared, I will click that button.')
# click the issues tag
await tab.click('[data-content="Issues"]')
await tab.wait_tag('#js-issues-search')
await tab.alert('Now will input some text and search the issues.')
await tab.mouse_click_element_rect('#js-issues-search')
await tab.keyboard_send(string='chromium')
await tab.js(
r'''document.querySelector('[role="search"]').submit()''')
await tab.wait_loading(5)
await asyncio.sleep(2)
await tab.alert('demo finished.')
# start repl mode?
# await tab.repl()
# no need to close tab for auto_close=True
# await tab.close()
# # close browser gracefully
# await cd.close_browser()
print('clearing the user data cache.')
await cd.clear_user_data_dir()
if __name__ == "__main__":
asyncio.run(main())
Listen the network
import asyncio
from ichrome import AsyncChromeDaemon
async def main():
async with AsyncChromeDaemon() as cd:
async with cd.connect_tab(index=0) as tab:
def filter_function(r):
try:
url = r['params']['response']['url']
return url == 'http://httpbin.org/get'
except KeyError:
pass
async with tab.wait_response_context(
filter_function=filter_function,
timeout=5,
) as r:
await tab.goto('http://httpbin.org/get')
result = await r
if result:
print(result['data'])
assert 'User-Agent' in result['data']
if __name__ == "__main__":
asyncio.run(main())
Example Code: examples_async.py & Classic Use Cases
AsyncChrome feature list
-
server
return
f"http://{self.host}:{self.port}"
, such ashttp://127.0.0.1:9222
-
version
version info from
/json/version
format like:{'Browser': 'Chrome/77.0.3865.90', 'Protocol-Version': '1.3', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', 'V8-Version': '7.7.299.11', 'WebKit-Version': '537.36 (@58c425ba843df2918d9d4b409331972646c393dd)', 'webSocketDebuggerUrl': 'ws://127.0.0.1:9222/devtools/browser/b5fbd149-959b-4603-b209-cfd26d66bdc1'}
-
connect
/check
/ok
check alive
-
get_tabs
/tabs
/get_tab
/get_tabs
get the
AsyncTab
instance from/json
. -
new_tab
/activate_tab
/close_tab
/close_tabs
operating tabs.
-
close_browser
find the activated tab and send
Browser.close
message, close the connected chrome browser gracefully.await chrome.close_browser()
-
kill
force kill the chrome process with self.port.
await chrome.kill()
-
connect_tabs
connect websockets for multiple tabs in one
with
context, and disconnect before exiting.tab0: AsyncTab = (await chrome.tabs)[0] tab1: AsyncTab = await chrome.new_tab() async with chrome.connect_tabs([tab0, tab1]): assert (await tab0.current_url) == 'about:blank' assert (await tab1.current_url) == 'about:blank'
-
connect_tab
The easiest way to get a connected tab. get an existing tab
async with chrome.connect_tab(0) as tab: print(await tab.current_title)
get a new tab and auto close it
async with chrome.connect_tab(None, True) as tab: print(await tab.current_title)
get a new tab with given url and auto close it
async with chrome.connect_tab('http://python.org', True) as tab: print(await tab.current_title)
AsyncTab feature list
-
set_url
/reload
navigate to a new url(return bool for whether load finished), or send
Page.reload
message. -
wait_event
listening the events with given name, and separate from other same-name events with filter_function, finally run the callback_function with result.
-
wait_page_loading
/wait_loading
wait for
Page.loadEventFired
event, or stop loading while timeout. Different fromwait_loading_finished
. -
wait_response
/wait_request
filt the
Network.responseReceived
/Network.requestWillBeSent
event byfilter_function
, return therequest_dict
which can be used byget_response
/get_response_body
/get_request_post_data
. WARNING: requestWillBeSent event fired do not mean the response is ready, should await tab.wait_request_loading(request_dict) or await tab.get_response(request_dict, wait_loading=True) -
wait_request_loading
/wait_loading_finished
sometimes event got
request_dict
withwait_response
, but the ajax request is still fetching, which need to wait theNetwork.loadingFinished
event. -
activate
/activate_tab
activate tab with websocket / http message.
-
close
/close_tab
close tab with websocket / http message.
-
add_js_onload
Page.addScriptToEvaluateOnNewDocument
, which means this javascript code will be run before page loaded. -
clear_browser_cache
/clear_browser_cookies
Network.clearBrowserCache
andNetwork.clearBrowserCookies
-
querySelectorAll
get the tag instance, which contains the
tagName, innerHTML, outerHTML, textContent, attributes
attrs. -
click
click the element queried by given css selector.
-
refresh_tab_info
to refresh the init attrs:
url
,title
. -
current_html
/current_title
/current_url
get the current html / title / url with
tab.js
. or using therefresh_tab_info
method and init attrs. -
crash
Page.crash
-
get_cookies
/get_all_cookies
/delete_cookies
/set_cookie
some page cookies operations.
-
set_headers
/set_ua
Network.setExtraHTTPHeaders
andNetwork.setUserAgentOverride
, used to update headers dynamically. -
close_browser
send
Browser.close
message to close the chrome browser gracefully. -
get_bounding_client_rect
/get_element_clip
get_element_clip
is alias name for the other, these two method is to get the rect of element which queried by css element. -
screenshot
/screenshot_element
get the screenshot base64 encoded image data.
screenshot_element
should be given a css selector to locate the element. -
get_page_size
/get_screen_size
size of current window or the whole screen.
-
get_response
get the response body with the given request dict.
-
js
run the given js code, return the raw response from sending
Runtime.evaluate
message. -
inject_js_url
inject some js url, like
<script src="xxx/static/js/jquery.min.js"></script>
do. -
get_value
&get_variable
run the given js variable or expression, and return the result.
await tab.get_value('document.title') await tab.get_value("document.querySelector('title').innerText")
-
keyboard_send
dispath key event with
Input.dispatchKeyEvent
-
mouse_click
dispath click event on given position
-
mouse_drag
dispath drag event on given position, and return the target x, y.
duration
arg is to slow down the move speed. -
mouse_drag_rel
dispath drag event on given offset, and return the target x, y.
-
mouse_drag_rel
drag with offsets continuously.
await tab.set_url('https://draw.yunser.com/') walker = await tab.mouse_drag_rel_chain(320, 145).move(50, 0, 0.2).move( 0, 50, 0.2).move(-50, 0, 0.2).move(0, -50, 0.2) await walker.move(50 * 1.414, 50 * 1.414, 0.2)
-
mouse_press
/mouse_release
/mouse_move
/mouse_move_rel
/mouse_move_rel_chain
similar to the drag features. These mouse features is only dispatched events, not the real mouse action.
-
history_back
/history_forward
/goto_history_relative
/reset_history
back / forward history
Command Line Usage (Daemon Mode)
Be used for launching a chrome daemon process. The unhandled args will be treated as chrome raw args and appended to extra_config list.
Shutdown Chrome process with the given port
λ python3 -m ichrome -s 9222
2018-11-27 23:01:59 DEBUG [ichrome] base.py(329): kill chrome.exe --remote-debugging-port=9222
2018-11-27 23:02:00 DEBUG [ichrome] base.py(329): kill chrome.exe --remote-debugging-port=9222
Launch a Chrome daemon process
λ python3 -m ichrome -p 9222 --start_url "http://bing.com" --disable_image
2018-11-27 23:03:57 INFO [ichrome] __main__.py(69): ChromeDaemon cmd args: {'daemon': True, 'block': True, 'chrome_path': '', 'host': 'localhost', 'port': 9222, 'headless': False, 'user_agent': '', 'proxy': '', 'user_data_dir': None, 'disable_image': True, 'start_url': 'http://bing.com', 'extra_config': '', 'max_deaths': 1, 'timeout': 2}
Crawl the given URL, output the HTML DOM
λ python3 -m ichrome --crawl --headless --timeout=2 http://api.ipify.org/
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">38.143.68.66</pre></body></html>
To use default user dir (ignore ichrome user-dir settings)
ensure the existing Chromes get closed
λ python -m ichrome -U null
Details:
$ python3 -m ichrome --help
usage:
All the unknown args will be appended to extra_config as chrome original args.
Maybe you can have a try by typing: `python3 -m ichrome --try`
Demo:
> python -m ichrome -H 127.0.0.1 -p 9222 --window-size=1212,1212 --incognito
> ChromeDaemon cmd args: port=9222, {'chrome_path': '', 'host': '127.0.0.1', 'headless': False, 'user_agent': '', 'proxy': '', 'user_data_dir': WindowsPath('C:/Users/root/ichrome_user_data'), 'disable_image': False, 'start_url': 'about:blank', 'extra_config': ['--window-size=1212,1212', '--incognito'], 'max_deaths': 1, 'timeout':1, 'proc_check_interval': 5, 'debug': False}
> python -m ichrome
> ChromeDaemon cmd args: port=9222, {'chrome_path': '', 'host': '127.0.0.1', 'headless': False, 'user_agent': '', 'proxy': '', 'user_data_dir': WindowsPath('C:/Users/root/ichrome_user_data'), 'disable_image': False, 'start_url': 'about:blank', 'extra_config': [], 'max_deaths': 1, 'timeout': 1, 'proc_check_interval': 5, 'debug': False}
Other operations:
1. kill local chrome process with given port:
python -m ichrome -s 9222
python -m ichrome -k 9222
2. clear user_data_dir path (remove the folder and files):
python -m ichrome --clear
python -m ichrome --clean
python -m ichrome -C -p 9222
3. show ChromeDaemon.__doc__:
python -m ichrome --doc
4. crawl the URL, output the HTML DOM:
python -m ichrome --crawl --headless --timeout=2 http://myip.ipip.net/
optional arguments:
-h, --help show this help message and exit
-v, -V, --version ichrome version info
-c CONFIG, --config CONFIG
load config dict from JSON file of given path
-cp CHROME_PATH, --chrome-path CHROME_PATH, --chrome_path CHROME_PATH
chrome executable file path, default to null for
automatic searching
-H HOST, --host HOST --remote-debugging-address, default to 127.0.0.1
-p PORT, --port PORT --remote-debugging-port, default to 9222
--headless --headless and --hide-scrollbars, default to False
-s SHUTDOWN, -k SHUTDOWN, --shutdown SHUTDOWN
shutdown the given port, only for local running chrome
-A USER_AGENT, --user-agent USER_AGENT, --user_agent USER_AGENT
--user-agent, default to Chrome PC: Mozilla/5.0
(Linux; Android 6.0; Nexus 5 Build/MRA58N)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/83.0.4103.106 Mobile Safari/537.36
-x PROXY, --proxy PROXY
--proxy-server, default to None
-U USER_DATA_DIR, --user-data-dir USER_DATA_DIR, --user_data_dir USER_DATA_DIR
user_data_dir to save user data, default to
~/ichrome_user_data
--disable-image, --disable_image
disable image for loading performance, default to
False
-url START_URL, --start-url START_URL, --start_url START_URL
start url while launching chrome, default to
about:blank
--max-deaths MAX_DEATHS, --max_deaths MAX_DEATHS
restart times. default to 1 for without auto-restart
--timeout TIMEOUT timeout to connect the remote server, default to 1 for
localhost
-w WORKERS, --workers WORKERS
the number of worker processes, default to 1
--proc-check-interval PROC_CHECK_INTERVAL, --proc_check_interval PROC_CHECK_INTERVAL
check chrome process alive every interval seconds
--crawl crawl the given URL, output the HTML DOM
-C, --clear, --clear clean user_data_dir
--doc show ChromeDaemon.__doc__
--debug set logger level to DEBUG
-K, --killall killall chrome launched local with --remote-debugging-
port
-t, --try, --demo, --repl
Have a try for ichrome with repl mode.
What's More?
As we known, Chrome
browsers (including various webdriver versions) will have the following problems in a long-running scene:
- memory leak
- missing websocket connections
- infinitely growing cache
- other unpredictable problems...
So you may need a more stable process pool with ChromeEngine(HTTP usage & normal usage):
Show more
ChromeEngine HTTP usage
Server
pip install -U ichrome[web]
import os
import uvicorn
from fastapi import FastAPI
from ichrome import AsyncTab
from ichrome.routers.fastapi_routes import ChromeAPIRouter
app = FastAPI()
# reset max_msg_size and window size for a large size screenshot
AsyncTab._DEFAULT_WS_KWARGS['max_msg_size'] = 10 * 1024**2
app.include_router(ChromeAPIRouter(workers_amount=os.cpu_count(),
headless=True,
extra_config=['--window-size=1920,1080']),
prefix='/chrome')
uvicorn.run(app)
# view url with your browser
# http://127.0.0.1:8000/chrome/screenshot?url=http://bing.com
# http://127.0.0.1:8000/chrome/download?url=http://bing.com
Client
from torequests import tPool
from inspect import getsource
req = tPool()
async def tab_callback(self, tab, data, timeout):
await tab.set_url(data['url'], timeout=timeout)
return (await tab.querySelector('h1')).text
r = req.post('http://127.0.0.1:8000/chrome/do',
json={
'data': {
'url': 'http://httpbin.org/html'
},
'tab_callback': getsource(tab_callback),
'timeout': 10
})
print(r.text)
# "Herman Melville - Moby-Dick"
ChromeEngine normal usage
Connect tab and do something
import asyncio
from ichrome.pool import ChromeEngine
def test_chrome_engine_connect_tab():
async def _test_chrome_engine_connect_tab():
async with ChromeEngine(port=9234, headless=True,
disable_image=True) as ce:
async with ce.connect_tab(port=9234) as tab:
await tab.goto('http://pypi.org')
print(await tab.title)
asyncio.get_event_loop().run_until_complete(
_test_chrome_engine_connect_tab())
if __name__ == "__main__":
test_chrome_engine_connect_tab()
# INFO 2020-10-13 22:18:53 [ichrome] pool.py(464): [enqueue](0) ChromeTask(<9234>, PENDING, id=1, tab=None), timeout=None, data=<ichrome.pool._TabWorker object at 0x000002232841D9A0>
# INFO 2020-10-13 22:18:55 [ichrome] pool.py(172): [online] ChromeWorker(<9234>, 0/5, 0 todos) is online.
# INFO 2020-10-13 22:18:55 [ichrome] pool.py(200): ChromeWorker(<9234>, 0/5, 0 todos) get a new task ChromeTask(<9234>, PENDING, id=1, tab=None).
# PyPI · The Python Package Index
# INFO 2020-10-13 22:18:57 [ichrome] pool.py(182): [offline] ChromeWorker(<9234>, 0/5, 0 todos) is offline.
# INFO 2020-10-13 22:18:57 [ichrome] pool.py(241): [finished](0) ChromeTask(<9234>, PENDING, id=1, tab=None)
Batch Tasks
import asyncio
from inspect import getsource
from ichrome.pool import ChromeEngine
async def tab_callback(self, tab, url, timeout):
await tab.set_url(url, timeout=5)
return await tab.title
def test_pool():
async def _test_pool():
async with ChromeEngine(max_concurrent_tabs=5,
headless=True,
disable_image=True) as ce:
tasks = [
asyncio.ensure_future(
ce.do('http://bing.com', tab_callback, timeout=10))
for _ in range(3)
] + [
asyncio.ensure_future(
ce.do(
'http://bing.com', getsource(tab_callback), timeout=10))
for _ in range(3)
]
for task in tasks:
result = await task
print(result)
assert result
# asyncio.run will raise aiohttp issue: https://github.com/aio-libs/aiohttp/issues/4324
asyncio.get_event_loop().run_until_complete(_test_pool())
if __name__ == "__main__":
test_pool()
TODO
-
Concurrent support. (gevent, threading, asyncio) - Add auto_restart while crash.
-
Auto remove the zombie tabs with a lifebook. - Add some useful examples.
- Coroutine support (for asyncio).
- Standard test cases.
- Stable Chrome Process Pool.
- HTTP apis server console with FastAPI.
- Complete the document.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file ichrome-2.8.0-py3-none-any.whl
.
File metadata
- Download URL: ichrome-2.8.0-py3-none-any.whl
- Upload date:
- Size: 64.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4f1d202ad284e3f2ed801566e88152fa0d616332129bf0e515bf8eeeb64391b |
|
MD5 | 58628c1a8c8db7e5936983c29849718b |
|
BLAKE2b-256 | 163a38e55b2b2e5f058701c2f105fb4df68816bcb25303114d5eb2eb189f4378 |