A high-performance async web scraping and automation framework using Selenium.
Project description
selenium_swift
selenium_swift is a powerful Python package designed to accelerate and simplify web scraping tasks using Selenium. With a focus on speed, accuracy, and ease of use, selenium_swift offers advanced features that cater to both beginners and experienced developers.
Key Features
-
Advanced Element Handling: Interact with web elements effortlessly using a high-level API. The
Elementclass supports synchronous and asynchronous operations, making actions like clicking, sending keys, and capturing screenshots straightforward. -
Frame Management: The
Frameclass makes working with iframes easier by providing methods to switch and focus on specific frames, ensuring precise element interactions within complex page structures. -
Chrome Extension Integration: Use the
ChromeExtensionclass to manage and interact with Chrome extensions directly within your scraping tasks. -
Flexible WebDriver Options: Configure WebDriver settings with the
WebOptionclass, including headless mode, proxy settings, and custom profiles. Tailor your WebDriver to suit specific scraping needs. -
Automatic Driver Management: The
WebServiceclass handles WebDriver installations for Chrome, Firefox, and Edge browsers, leveragingwebdriver-managerfor seamless driver management. -
Asynchronous and Synchronous Support: Choose between async programming with
asyncioor traditional synchronous methods to optimize performance and flexibility. -
User-Friendly API: Designed for simplicity and efficiency,
selenium_swiftabstracts complex Selenium operations, making web scraping accessible to beginners while offering powerful tools for advanced users.
Installation
Install selenium_swift from PyPI using pip:
pip install selenium-swift
Usage Example
Example 1:
Explanation
This example demonstrates how to handle interactions with pages that open as a result of an event (such as a click or key press) using a custom browser class built on top of ChromeBrowser. The code showcases how to find elements on a page, trigger events to open new pages, and interact with the newly opened pages asynchronously.
Key Concepts:
-
Browser Class: We define a class
MyBrowserthat extends fromChromeBrowserto customize browser behavior. -
Async Tab Method: Methods that interact with browser tabs should be named with a
tabprefix, which the framework recognizes as a tab interaction. -
Page Navigation: The example shows how to load a page, find specific elements (in this case, product thumbnails), and handle page transitions when an event (like a click) triggers the opening of a new page.
-
Handling New Pages: After triggering an event that opens a new page, the script switches focus to the new page and interacts with its contents.
Example Code
from selenium_swift.browser import *
class MyBrowser(ChromeBrowser):
"""
MyBrowser extends ChromeBrowser to define custom interactions with web pages.
This class demonstrates how to interact with elements on a page and handle events
that open new browser tabs or windows.
"""
def __init__(self) -> None:
# Initialize the browser with Chrome-specific options and service
super().__init__(ChromeOption(), ChromeService())
async def tab_1(self):
"""
This method opens a webpage and interacts with its elements. Specifically, it clicks on
product thumbnails, which open new pages, and interacts with the newly opened page.
"""
# Open the page at the specified URL
page = await self.get('https://books.toscrape.com/')
# Find all product elements in page
products = await page.find_elements('css_selector', '.thumbnail')
print(f"Found {len(products)} products.")
# Loop through each product, click it to open a new page, and interact with the new page
for prd in products:
# Click the product, which opens a new page
prd.click()
# Switch focus to the newly opened page
infoPage = page.focus_to_new_page()
# Find the rows in the table on the new page using a CSS selector
table_rows = await infoPage.find_elements('css_selector', 'table[class*="table-stripe"] tr')
print("********** Table Content **********")
# Loop through the table rows and print their text content
for row in table_rows:
print(row.text)
if __name__ == "__main__":
# Start the browser with an instance of MyBrowser
Browser.startBrowsers([MyBrowser()])
Breakdown of the Code
-
MyBrowserClass: This class inherits fromChromeBrowser. Inside, we define thetab_1method to represent interactions on the first tab on window opened by the browser.- The
super().__init__(ChromeOption(), ChromeService())ensures that the browser is initialized with default Chrome options and services.
- The
-
tab_1Method:- This method loads the page
https://books.toscrape.com/. - It finds all product elements on the page using the CSS selector .
thumbnail. - For each product, the script clicks it, causing a new page to open.
- Once the new page is opened, the script switches focus to that page using
focus_to_new_page(). - It then locates a table of data on the new page using a CSS selector, iterates through the rows, and prints the content of each row.
- This method loads the page
-
Browser Startup: The
if __name__ == "__main__":block ensures that the script runs the browser when executed. It callsBrowser.startBrowsers([MyBrowser()])to start the browser and execute the interactions defined intab_1.
Key Considerations
- Async Interactions: The example utilizes async programming to handle potentially slow operations (like loading a page or finding elements) without blocking the main thread.
- Scalability: You can extend this by adding more tab methods (e.g.,
tab_2,tab_3, etc.) to handle different interactions or pages. - Error Handling: In production environments, adding error handling (e.g., for timeouts or missing elements) is important for robustness.
This example is designed to demonstrate how to automate interaction with pages that open through events and how to interact with the newly opened page.
Example 2:
Explanation
This example demonstrates how to interact with new pages opened by an event (e.g., a click) using an object-oriented approach. Instead of directly focusing on a new page using focus_to_new_page(), we create a PageInfo class that extends NextPage, a base class designed for handling pages that are opened from other pages.
Key Concepts:
-
Page Class (
PageInfo): This class inherits fromNextPageand is used to represent and interact with pages that are opened by user interactions (like clicking on an element). -
Separation of Concerns: Each page interaction is encapsulated within its own class, making the code modular and easier to maintain.
-
Async Page Interactions: The
showDatamethod inPageInfoasynchronously finds elements and displays their data, demonstrating how to interact with a newly opened page.
Example Code
from selenium_swift.browser import * # Import base browser classes
from selenium_swift.web_option import ChromeOption # Import Chrome options
from selenium_swift.web_service import ChromeService # Import Chrome services
class PageInfo(NextPage):
"""
PageInfo is a class that extends NextPage. It is used to handle the
new page that opens after interacting with an element on the current page.
This class encapsulates interactions with the new page.
"""
def __init__(self) -> None:
super().__init__() # Initialize the NextPage base class
async def showData(self):
"""
This method finds table rows on the newly opened page and prints the content
of each row. The data is located using a CSS selector.
"""
# Locate the table rows using the CSS selector
table_rows = await self.find_elements('css_selector', 'table[class*="table-stripe"] tr')
# Print the content of each table row
print("********** Table Content **********")
for row in table_rows:
print(row.text)
class MyBrowser(ChromeBrowser):
"""
MyBrowser is a custom browser class that extends ChromeBrowser.
It contains methods to interact with the main page and handle navigation
to new pages.
"""
def __init__(self) -> None:
# Initialize ChromeBrowser with default Chrome options and services
super().__init__(ChromeOption(), ChromeService())
async def tab_1(self):
"""
This method interacts with the first tab. It opens a webpage, locates product elements,
and handles navigation to the new page when a product is clicked.
"""
# Load the main page
page = await self.get('https://books.toscrape.com/')
# Find all product elements on the page
products = await page.find_elements('css_selector', '.thumbnail')
print(f"Found {len(products)} products.")
# Loop through the products and handle interactions with the new page
for prd in products:
# Click the product, which opens a new page
prd.click()
# Create an instance of PageInfo to represent the new page
# and interact with it using the showData method
await PageInfo().showData()
if __name__ == "__main__":
# Start the browser with an instance of MyBrowser and open the first tab
Browser.startBrowsers([MyBrowser()])
Breakdown of the Code
-
PageInfoClass:- This class extends
NextPage, which is designed to represent a page that opens as a result of an interaction (like clicking on an element). - The method
showDataasynchronously finds table rows using a CSS selector and prints the content of each row.
- This class extends
-
MyBrowserClass:- This class extends
ChromeBrowserand defines thetab_1method for interactions on the main page. - It opens the main page (
https://books.toscrape.com/) and locates all product elements using the.thumbnailCSS selector. - When a product is clicked, a new page opens. Instead of focusing directly on the new page, an instance of
PageInfois created, and theshowDatamethod is called to interact with the new page.
- This class extends
-
Browser Flow:
- The browser starts by opening the main page, where it finds and clicks on product elements.
- Each click opens a new page, which is handled by
PageInfo.This class abstracts the interaction with the newly opened page, making the code cleaner and more modular.
-
Object-Oriented Design:
- By using a class (
PageInfo) to represent the new page, you ensure that all interactions with that page are encapsulated in one place. This separation of concerns makes the code easier to maintain and extend. - The base class
NextPagecan be extended further if more features need to be added, andPageInfocan be customized for specific interactions with different pages. This example shows how to manage page interactions using class inheritance, following best practices for code organization and readability.
- By using a class (
Example 3:
This example shows how to use selenium_swift to scrape a web page. Follow these steps:
- Create your own
Scrapclass that extends from thePageScrapeclass and contains theasync def onResponsemethod that includes your arg. - Create a
MyBrowserclass that extends fromChromeBrowser,FirefoxBrowser, orEdgeBrowser. Here, I useChromeBrowser. You should create async methods that begin with "tab", e.g.,tab_1,tab_2, etc. Each tab method will open a tab in your browser.
from selenium_swift.browser import *
class Scrap(PageScrape):
async def onResponse(self, **arg):
quote_elements = await self.find_elements('css_selector','.text')
for quote in quote_elements:
print(quote.text)
class MyBrowser(ChromeBrowser):
def __init__(self) -> None:
super().__init__(ChromeOption(), ChromeService())
async def tab_1(self):
for i in range(1, 3):
await Scrap(f'https://quotes.toscrape.com/page/{i}/').crawl(my_index=i)
async def tab_2(self):
for i in range(3, 6):
await Scrap(f'https://quotes.toscrape.com/page/{i}/').crawl(my_index=i)
async def tab_3(self):
for i in range(6, 9):
await Scrap(f'https://quotes.toscrape.com/page/{i}/').crawl(my_index=i)
async def tab_4(self):
for i in range(9, 11):
await Scrap(f'https://quotes.toscrape.com/page/{i}/').crawl(my_index=i)
if __name__ == "__main__":
Browser.startBrowsers([MyBrowser()])
Example 4: Concurrent File Upload and Download
This example demonstrates how to concurrently upload and download files using the selenium_swift package with a custom browser class.
Step 1: Create the MyBrowser Class
In this step, we will create a class named MyBrowser for example , which extends from the ChromeBrowser class. This class will contain two asynchronous methods: tab_download and tab_upload. Each method will handle a specific functionality—downloading files and uploading files—by opening separate tabs in the browser.
from selenium_swift.browser import *
class MyBrowser(ChromeBrowser):
def __init__(self) -> None:
# Set the download directory
self.path_download = r"c:\Users\progr\OneDrive\Bureau\test_download"
option = ChromeOption('download.default_directory=' + self.path_download)
super().__init__(option, ChromeService())
- Initialization: The
__init__method sets the download directory for downloaded files using theChromeOptionclass. This ensures that all downloaded files will be saved to the specified path.
Step 2: Implement the tab_download Method
The tab_download method will navigate to a page that contains downloadable files. It will identify links to PDF files and initiate the download process.
async def tab_download(self):
# Navigate to the download page
page = await self.get('https://the-internet.herokuapp.com/download')
link_list = await page.find_elements('css_selector', 'a')
# Iterate through the links and click on those that end with '.pdf'
for link in link_list:
if link.text.endswith('.pdf'):
link.click()
# Wait for the download to complete (put this statment in the end of the tab)
await page.wait_for_Download(self.path_download)
-
File Download Logic: The method retrieves all links on the page and checks if they end with the
.pdfextension. If so, it clicks the link to start the download. -
Waiting for Downloads: The await
page.wait_for_Download(self.path_download)statement ensures that the method waits until the download is completed before browser close all the tabs.
Step 3: Implement the tab_upload Method
The tab_upload method will navigate to a file upload page, locate the file input element, and upload a specified file.
async def tab_upload(self):
# Navigate to the upload page
page = await self.get('https://the-internet.herokuapp.com/upload')
# Locate the file input element and upload a file
input_file = await page.find_element('id', "file-upload")
input_file.send_file(r'c:\Users\progr\Downloads\DATA_Data_Analysis_2_AR.pdf')
# Optional: wait for a brief period to ensure the file is uploaded
await page.sleep(3)
- File Upload Logic: The method retrieves the file input element by its ID and uses the
send_filemethod to upload a specified file from the local system. - Sleep Function: The
await page.sleep(3)statement pauses the execution for 3 seconds, allowing time for the file upload to complete. It’s important to usepage.sleep()instead of time.sleep() in asynchronous code. Usingtime.sleep()will block the entire event loop, preventing other asynchronous tasks from running, which can lead to unresponsive behavior in your application. By usingawait page.sleep(), the event loop remains active, allowing other tasks to be executed concurrently while waiting.
Step 4: Running the Browser
Finally, we will execute the MyBrowser class to start the browser and perform the file upload and download tasks concurrently.
if __name__ == "__main__":
Browser.startBrowsers([MyBrowser()])
Summary
This example showcases how to create a custom browser class using selenium_swift for handling file uploads and downloads. By organizing the functionality into methods, you can easily maintain and extend the capabilities of your web scraping tasks.
from selenium_swift.browser import *
class MyBrowser(ChromeBrowser):
def __init__(self) -> None:
# Set the download directory
self.path_download = r"c:\Users\progr\OneDrive\Bureau\test_download"
option = ChromeOption('download.default_directory=' + self.path_download)
super().__init__(option, ChromeService())
async def tab_download(self):
# Navigate to the download page
page = await self.get('https://the-internet.herokuapp.com/download')
link_list = await page.find_elements('css_selector', 'a')
# Iterate through the links and click on those that end with '.pdf'
for link in link_list:
if link.text.endswith('.pdf'):
link.click()
# Wait for the download to complete (put this statment in the end of the tab)
await page.wait_for_Download(self.path_download)
async def tab_upload(self):
# Navigate to the upload page
page = await self.get('https://the-internet.herokuapp.com/upload')
# Locate the file input element and upload a file
input_file = await page.find_element('id', "file-upload")
input_file.send_file(r'c:\Users\progr\Downloads\DATA_Data_Analysis_2_AR.pdf')
# Optional: wait for a brief period to ensure the file is uploaded
await page.sleep(3)
Example 5: Custom Page Handling in selenium_swift
This example demonstrates how to create custom page classes that extend the PageEvent class within the selenium_swift framework. This approach allows for modular and organized handling of web interactions, such as downloading and uploading files.
Overview
In this implementation, two separate pages are created:
- PageDownload: This class is designed for downloading files from a specific webpage.
- PageUpload: This class facilitates uploading files to a designated webpage.
You can create custom page classes to manage complex interactions, such as clicks, file uploads, mouse events, and other interactions.
Implementation
from selenium_swift.browser import *
# Define the PageDownload class to handle file downloads
class PageDownload(PageEvent):
def __init__(self) -> None:
super().__init__('https://the-internet.herokuapp.com/download')
async def download_images(self):
link_list = await self.find_elements('css_selector', 'a')
for link in link_list:
if link.text.endswith(('.png', '.jpg')):
link.click()
async def download_pdf(self):
link_list = await self.find_elements('css_selector', 'a')
for link in link_list:
if link.text.endswith('.pdf'):
link.click()
async def download_text_files(self):
link_list = await self.find_elements('css_selector', 'a')
for link in link_list:
if link.text.endswith('.txt'):
link.click()
# Define the PageUpload class to handle file uploads
class PageUpload(PageEvent):
def __init__(self) -> None:
super().__init__('https://the-internet.herokuapp.com/upload')
async def upload_image(self, image_path):
input_file = await self.find_element('id', "file-upload")
input_file.send_file(image_path)
async def upload_pdf(self, pdf_path):
input_file = await self.find_element('id', "file-upload")
input_file.send_file(pdf_path)
async def upload_text_file(self, text_file_path):
input_file = await self.find_element('id', "file-upload")
input_file.send_file(text_file_path)
# Define the MyBrowser1 class to manage download and upload actions
class MyBrowser1(ChromeBrowser):
def __init__(self) -> None:
self.path_download = r"c:\Users\progr\OneDrive\Bureau\test_download"
option = ChromeOption('download.default_directory=' + self.path_download)
super().__init__(option, ChromeService())
async def tab_download(self):
page_download = await PageDownload().open()
await page_download.download_pdf()
await page_download.download_images()
await page_download.download_text_files()
await page_download.wait_for_Download(self.path_download)
async def tab_upload(self):
page_upload = await PageUpload().open()
await page_upload.upload_image(r"c:\Users\progr\Downloads\nature2.jpg")
await page_upload.upload_pdf(r"c:\Users\progr\Downloads\DATA_Data_Analysis_2_AR.pdf")
await page_upload.upload_text_file(r'd:\ascii.txt')
await page_upload.sleep(3)
# Start the browser and run the download and upload tasks
if __name__ == "__main__":
Browser.startBrowsers([MyBrowser1()])
Explanation
- Custom Page Classes:
- PageDownload: This class encapsulates methods to download different file types. Each method fetches all links on the page and clicks on the ones that match the specified file extensions.
download_images(): Downloads image files with.pngor.jpgextensions.download_pdf(): Downloads files with a.pdfextension.download_text_files(): Downloads files with a.txtextension.
- PageUpload: This class provides methods to upload files. Each method allows for the upload of a specific file type.
upload_image(image_path): Uploads an image file.upload_pdf(pdf_path): Uploads a PDF file.upload_text_file(text_file_path): Uploads a text file.
- PageDownload: This class encapsulates methods to download different file types. Each method fetches all links on the page and clicks on the ones that match the specified file extensions.
- MyBrowser1 Class:
- This class extends
ChromeBrowserand manages two separate tabs for downloading and uploading files. The methods prefixed withtab_signal to the browser that they will open a new tab. tab_download(): Opens the download page and executes methods to download various file types, followed by waiting for the download to complete.tab_upload(): Opens the upload page and executes methods to upload specified files. The sleep method is called to pause execution for a brief period, allowing the upload to complete.
- This class extends
Conclusion
By extending the PageEvent class, you can create specialized page handling classes that streamline file download and upload processes, making your web scraping tasks more efficient and organized. This structure also enhances readability and maintainability of your code.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file selenium_swift-0.1.1.tar.gz.
File metadata
- Download URL: selenium_swift-0.1.1.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df2c4fc2993e63ff2d7edf28564666ef3babc244c6e49d9daaa90e61cc438050
|
|
| MD5 |
d34b272ce7527d4064ad4ce114572ac7
|
|
| BLAKE2b-256 |
a1b717e73621fcaa628177951f51387519df9cf25a70dbe6300c66b8b966b913
|
File details
Details for the file selenium_swift-0.1.1-py3-none-any.whl.
File metadata
- Download URL: selenium_swift-0.1.1-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2df95a45c782c52b1aa33ce3794ed62c3f7e34717490aa6ee28a6e31da7f9320
|
|
| MD5 |
0b54283341dd42876ca7485f82eeb18f
|
|
| BLAKE2b-256 |
799d11df22213b780f67f40fded1ef3eccaa5223611830c7c6a3a2a78fb5df94
|