scrapes a website (Selenium, SeleniumBase, undetected chromedriver ...) with iframes and returns a DataFrame
Project description
scrapes a website (Selenium, SeleniumBase, undetected chromedriver ...) with iframes and returns a DataFrame
Tested against Windows / Python 3.11 / Anaconda
pip install multiiframes2df
Scrapes data using the provided driver and processes it to return a DataFrame
which includes each element and its children.
Args:
driver: The driver used to scrape the data.
filter_function: A function to filter the scraped data (default is None).
chunks: The number of chunks to divide the data into for processing (default is 1).
processes: The number of processes to use for parallel processing (default is 4).
print_stdout: Boolean indicating whether to print stdout (default is False).
print_stderr: Boolean indicating whether to print stderr (default is True).
Returns:
pandas DataFrame: The processed and filtered data.
Example:
from PrettyColorPrinter import add_printer # optional
from seleniumbase import Driver
from multiiframes2df import fast_scrape
add_printer(1)
driver = Driver(uc=True, undetected=True)
driver.get(r"https://testpages.herokuapp.com/styled/iframes-test.html")
df = fast_scrape(
driver=driver,
filter_function=None,
chunks=1,
processes=4,
print_stdout=False,
print_stderr=True,
)
for name, group in df.groupby("aa_groupnumber"):
print(name, group)
df2 = fast_scrape(
driver=driver,
filter_function=lambda x: "List" in x and "<html>" not in x and "<body>" not in x,
chunks=1,
processes=4,
print_stdout=False,
print_stderr=True,
)
for name, group in df2.groupby("aa_groupnumber"):
print(name, group)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
multiiframes2df-0.10.tar.gz
(5.6 kB
view details)
Built Distribution
File details
Details for the file multiiframes2df-0.10.tar.gz
.
File metadata
- Download URL: multiiframes2df-0.10.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9aa4811becaf973e2ddc665f46c6b897abf3c12828cd4915174116a4f0b29cb9 |
|
MD5 | 10db62ca570e0e6448f6ba5821414d28 |
|
BLAKE2b-256 | eb984f075cc55800d99a2e17b11dc3d81c105fba6e1ddd36b7297b484d5399a6 |
File details
Details for the file multiiframes2df-0.10-py3-none-any.whl
.
File metadata
- Download URL: multiiframes2df-0.10-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6ce8bafa8595539ff42478cbfb61e05441a7edaae6ac8c5a5198011b6a51d0c |
|
MD5 | f721efae92d6caef0d2afde4dafe0a06 |
|
BLAKE2b-256 | 6d90c2b2b2a80214864b04f3bacf5a58673807cde524c2514e9517ac2757de0c |