The All in One Web Scraping Framework

These details have not been verified by PyPI

Project links

Project description

botasaurus

🤖 Botasaurus 🤖

The web has evolved. Finally, web scraping has too.

The All in 1 Web Scraping Framework

In a nutshell

Botasaurus is an all in 1 web scraping framework built for the modern web. We address the key pain points web scrapers face when scraping the web.

Our aim it to make web scraping extremely easy and save you hours of Development Time.

Features

Botasaurus comes fully baked, with batteries included. Here is a list of things it can do that no other web scraping framework can:

Anti Detect: Make Anti Detect Requests and Selenium Visits.
SSL Support for Authenticated Proxy: We are the first and only Python Web Scraping Framework to offer SSL support for authenticated proxies. No other web scraping libraries be it cloudscraper, seleniumwire, playwright provides this unique feature, enabling you to easily bypass Cloudflare detection when using authenticated proxies.
Data Cleaners: Clean data scraped from the website with ease.
Debuggability: When a crash occurs due to an incorrect selector, etc., Botasaurus pauses the browser instead of closing it, facilitating painless on-the-spot debugging.
Caching: Botasaurus allows you to cache web scraping results, ensuring lightning-fast performance on subsequent scrapes.
Easy Configuration: Easily save time with parallelization, profile, and proxy configuration.
Time-Saving Selenium Shortcuts: Botasaurus comes with numerous Selenium shortcuts to make web scraping incredibly easy.

🚀 Getting Started with Botasaurus

Welcome to Botasaurus! Let’s dive right in with a straightforward example to understand how it works.

In this tutorial, we will go through the steps to scrape the heading text from https://www.omkar.cloud/.

Botasaurus in action

Step 1: Install Botasaurus

First things first, you need to install Botasaurus. Run the following command in your terminal:

python -m pip install botasaurus

Step 2: Set Up Your Botasaurus Project

Next, let’s set up the project:

Create a directory for your Botasaurus project and navigate into it:

mkdir my-botasaurus-project
cd my-botasaurus-project
code .  # This will open the project in VSCode if you have it installed

Step 3: Write the Scraping Code

Now, create a Python script named main.py in your project directory and insert the following code:

from botasaurus import *

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }
     
if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

Let’s dissect this code:

We define a custom scraping task, scrape_heading_task, decorated with @browser:

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):

Botasaurus automatically provides an Anti Detection Selenium driver to our function:

def scrape_heading_task(driver: AntiDetectDriver, data):

Inside the function, we:
- Navigate to Omkar Cloud
- Extract the heading text
- Return the data to be automatically saved as scrape_heading_task.json by Botasaurus:

    driver.get("https://www.omkar.cloud/")
    heading = driver.text("h1")
    return {"heading": heading}

Finally, we initiate the scraping task:

if __name__ == "__main__":
    scrape_heading_task()

Step 4: Run the Scraping Task

Time to run your bot:

python main.py

After executing the script, it will:

Launch Google Chrome
Navigate to omkar.cloud
Extract the heading text
Save it automatically as output/scrape_heading_task.json.

Botasaurus in action

Now, let’s explore another way to scrape the heading using the request module. Replace the previous code in main.py with the following:

from botasaurus import *

@request
def scrape_heading_task(request: AntiDetectRequests, data):
    # Navigate to the Omkar Cloud website
    soup = request.bs4("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = soup.find('h1').get_text()

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }
     
if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

In this code:

We are using the BeautifulSoup (bs4) module to parse and scrape the heading.
The request object provided is not a standard Python request object but an Anti Detect request object, which also preserves cookies.

Step 5: Run the Scraping Task (Using Anti Detect Requests)

Finally, run the bot again:

python main.py

This time, you will observe the same result as before, but instead of using Anti Detect Selenium, we are utilizing the Anti Detect request module.

Note: If you don't have Python installed, then you can run Botasaurus in Gitpod, a browser-based development environment, by folllowing this section.

💡 Understanding Botasaurus

Power of Bots is Immense, A Bot

Can apply on your behalf to Linkedin Jobs 24 Hours
Scrape Phone Number of Thousands of Buisnesses from Google Maps to sell your Products to.
Mass Message People on Twitter/LinkedIn/Reddit for selling your Product
Sign Up 100's of Accounts on MailChimp to send 50,000 (500 emails * 100) Emails all for free

Let's learn feautres of Botasaurus that helps you unlock these super powers?

Could you show me an example where you defeat Cloudflare?

Sure, Run the following Python code to scrape G2.com, a website protected by Cloudflare:

from botasaurus import *

@browser()
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.google_get("https://www.g2.com/products/github/reviews")
    heading = driver.text('h1')
    print(heading)
    return heading

After running this script, you'll notice that the G2 page opens successfully, and the code prints the page's heading.

not blocked

How to Scrape Multiple Data Points/Links?

To scrape multiple data points or links, define the data variable and provide a list of items to be scraped:

@browser(data=["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"])
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

Botasaurus will launch a new browser instance for each item in the list and merge and store the results in scrape_heading_task.json at the end of the scraping.

scraped data

Please note that the data parameter can also handle items such as dictionaries.

For instance, if you're automating the sign-up process for bot accounts on a website, you can pass dictionaries to it like so:

@browser(data=[{"name": "Mahendra Singh Dhoni", ...}, {"name": "Virender Sehwag", ...}])
def scrape_heading_task(driver: AntiDetectDriver, data: dict):
    # ...

How to Scrape in Parallel?

To scrape data in parallel, set the parallel option in the browser decorator:

@browser(parallel=3, data=["https://www.omkar.cloud/", ...])
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

How to know how many scrapers to run parallely?

To determine the optimal number of parallel scrapers, pass the bt.calc_max_parallel_browsers function, which calculates the maximum number of browsers that can be run in parallel based on the available RAM:

@browser(parallel=bt.calc_max_parallel_browsers, data=["https://www.omkar.cloud/", ...])
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

Example: If you have 5.8 GB of free RAM, bt.calc_max_parallel_browsers would return 10, indicating you can run up to 10 browsers in parallel.

How to Cache the Web Scraping Results?

To cache web scraping results and avoid re-scraping the same data, set cache=True in the decorator:

@browser(cache=True, data=["https://www.omkar.cloud/", ...])
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

How Botasaurus helps me in debugging?

Botasaurus enhances the debugging experience by pausing the browser instead of closing it when an error occurs. This allows you to inspect the page and understand what went wrong, which can be especially helpful in debugging and removing the hassle of reproducing edge cases.

Botasaurus also plays a beep sound to alert you when an error occurs.

How to Block Resources like CSS, Images, and Fonts to Save Bandwidth?

Blocking resources such as CSS, images, and fonts can significantly speed up your web scraping tasks, reduce bandwidth usage, and save money spent on proxies.

For example, a page that originally takes 4 seconds and 12 MB's to load might only take one second and 100 KB to load after css, images, etc have been blocked.

To block images, simply use the block_resources parameter. For example:

@browser(block_resources=True) # Blocks ['.css', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip']
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.get("https://www.omkar.cloud/")
    driver.prompt()
    
scrape_heading_task()

If you wish to block only images and fonts, while allowing CSS files, you can set block_images like this:

@browser(block_images=True) # Blocks ['.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip']
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.get("https://www.omkar.cloud/")
    driver.prompt()

scrape_heading_task()

To block a specific set of resources, such as only JavaScript, CSS, fonts, etc., specify them in the following manner:

@browser(block_resources=['.js', '.css', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip'])
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.get("https://www.omkar.cloud/")
    driver.prompt()

scrape_heading_task()

How to Configure UserAgent, Proxy, Chrome Profile, Headless, etc.?

To configure various settings such as UserAgent, Proxy, Chrome Profile, and headless mode, you can specify them in the decorator as shown below:

@browser(
  headless=True, 
  profile='my-profile', 
  proxy="http://your_proxy_address:your_proxy_port",
  user_agent=bt.UserAgents.user_agent_106
)
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

You can also pass additional parameters when calling the scraping function, as demonstrated below:

@browser()
def scrape_heading_task(driver: AntiDetectDriver, data):
    # ...

data = "https://www.omkar.cloud/"
scrape_heading_task(
  data, 
  headless=True, 
  profile='my-profile', 
  proxy="http://your_proxy_address:your_proxy_port",
  user_agent=bt.UserAgents.user_agent_106
)

Furthermore, it's possible to define functions that dynamically set these parameters based on the data item. For instance, to set the profile dynamically according to the data item, you can use the following approach:

@browser(profile=lambda data: data["profile"], headless=True, proxy="http://your_proxy_address:your_proxy_port", user_agent=bt.UserAgents.user_agent_106)
def scrape_heading_task(driver: AntiDetectDriver, data):
    # ...

data = {"link": "https://www.omkar.cloud/", "profile": "my-profile"}
scrape_heading_task(data)

Additionally, if you need to pass metadata that is common across all data items, such as an API Key, you can do so by adding it as a metadata parameter. For example:

@browser()
def scrape_heading_task(driver: AntiDetectDriver, data, metadata):
    print("metadata:", metadata)
    print("data:", data)

data = {"link": "https://www.omkar.cloud/", "profile": "my-profile"}
scrape_heading_task(
  data, 
  metadata={"api_key": "BDEC26..."}
)

Do you support SSL for Authenticated Proxies?

Yes, we are the first Python Library to support SSL for authenticated proxies. Proxy providers like BrightData, IPRoyal, and others typically provide authenticated proxies in the format "http://username:password@proxy-provider-domain:port". For example, "http://greyninja:awesomepassword@geo.iproyal.com:12321".

However, if you use an authenticated proxy with a library like seleniumwire to scrape a Cloudflare protected website like G2.com, you will surely be blocked because you are using a non-SSL connection.

To verify this, run the following code:

First, install the necessary packages:

python -m pip install selenium_wire chromedriver_autoinstaller

Then, execute this Python script:

from seleniumwire import webdriver
from chromedriver_autoinstaller import install

# Define the proxy
proxy_options = {
    'proxy': {
        'http': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy
        'https': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy
    }
}

# Install and set up the driver
driver_path = install()
driver = webdriver.Chrome(driver_path, seleniumwire_options=proxy_options)

# Navigate to the desired URL
link = 'https://www.g2.com/products/github/reviews'
driver.get("https://www.google.com/")
driver.execute_script(f'window.location.href = "{link}"')

# Prompt for user input
input("Press Enter to exit...")

# Clean up
driver.quit()

You will definetely encounter a block by Cloudflare:

blocked

However, using proxies with Botasaurus prevents this issue. See the difference by running the following code:

from botasaurus import *

@browser(proxy="http://username:password@proxy-provider-domain:port") # TODO: Replace with your own proxy 
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.google_get("https://www.g2.com/products/github/reviews")
    driver.prompt()
scrape_heading_task()

Result: not blocked

NOTE: To run the code above, you will need Node.js installed.

I want to Scrape a large number of Links, a new selenium driver is getting created for each new link, this increases the time to scrape data. How can I reuse Drivers?

Utilize the reuse_driver option to reuse drivers, reducing the time required for data scraping:

@browser(reuse_driver=True)
def scrape_heading_task(driver: AntiDetectDriver, data):
  # ...

Could you show me a practical example where all these Botasaurus Features Come Together to accomplish a typical web scraping project?

Below is a practical example of how Botasaurus features come together in a typical web scraping project to scrape a list of links from a blog, and then visit each link to retrieve the article's heading and date:

from botasaurus import *

@browser(block_resources=True,
         cache=True, 
         parallel=bt.calc_max_parallel_browsers, 
         reuse_driver=True)
def scrape_articles(driver: AntiDetectDriver, link):
    driver.get(link)

    heading = driver.text("h1")
    date = driver.text("time")

    return {
        "heading": heading, 
        "date": date, 
        "link": link, 
    }

@browser(block_resources=True, cache=True)
def scrape_article_links(driver: AntiDetectDriver, data):
    # Visit the Omkar Cloud website
    driver.get("https://www.omkar.cloud/blog/")
    
    links = driver.links("h3 a")

    return links

if __name__ == "__main__":
    # Launch the web scraping task
    links = scrape_article_links()
    scrape_articles(links)

How to Clean Data?

Botasaurus provides a module named cl that includes commonly used cleaning functions to save development time. Here are some of the most important ones:

cl.select
What document.querySelector is to js, is what cl.select is to json. This is the most used function in Botasaurus and is incredibly useful.

This powerful function is popular for safely selecting data from nested JSON.

Instead of using flaky code like this:

from botasaurus import cl

data = {
  "person": {
    "data": {
      "name": "Omkar",
      "age": 21
    }
  },
  "data": {
    "friends": [
      {"name": "John", "age": 21},
      {"name": "Jane", "age": 21},
      {"name": "Bob", "age": 21}
    ]
  }
}

name = data.get('person', {}).get('data', {}).get('name', None)
if name:
    name = name.upper()
else:
    name = None
print(name)

You can write it as:

from botasaurus import cl

data = {
  "person": {
    "data": {
      "name": "Omkar",
      "age": 21
    }
  },
  "data": {
    "friends": [
      {"name": "John", "age": 21},
      {"name": "Jane", "age": 21},
      {"name": "Bob", "age": 21}
    ]
  }
}
print(cl.select(data, 'name', map_data=lambda x: x.upper()))

cl.select returns None if the key is not found, instead of throwing an error.

from botasaurus import cl

print(cl.select(data, 'name'))  # Omkar
print(cl.select(data, 'friends', 0, 'name'))  # John
print(cl.select(data, 'friends', 0, 'non_existing_key'))  # None

You can also use map_data like this:

from botasaurus import cl

cl.select(data, 'name', map_data=lambda x: x.upper())  # OMKAR

And use default values like this:

from botasaurus import cl

cl.select(None, 'name', default="OMKAR")  # OMKAR

cl.extract_numbers

from botasaurus import cl

print(cl.extract_numbers("I can extract numbers with decimal like 4.5, or with comma like 1,000."))  # [4.5, 1000]

More Functions

from botasaurus import cl

print(cl.extract_links("I can extract links like https://www.omkar.cloud/ or https://www.omkar.cloud/blog/"))  # ['https://www.omkar.cloud/', 'https://www.omkar.cloud/blog/']
print(cl.rename_keys({"name": "Omkar", "age": 21}, {"name": "full_name"}))  # {"full_name": "Omkar", "age": 21}
print(cl.sort_object_by_keys({"age": 21, "name": "Omkar"}, "name"))  # {"name": "Omkar", "age": 21}
print(cl.extract_from_dict([{"name": "John", "age": 21}, {"name": "Jane", "age": 21}, {"name": "Bob", "age": 21}], "name"))  # ["John", "Jane", "Bob"]
# ... And many more

How to Read/Write JSON and CSV Files?

Botasaurus provides convenient methods for reading and writing data:

# Data to write to the file
data = [
    {"name": "John Doe", "age": 42},
    {"name": "Jane Smith", "age": 27},
    {"name": "Bob Johnson", "age": 35}
]

# Write the data to the file "data.json"
bt.write_json(data, "data.json")

# Read the contents of the file "data.json"
print(bt.read_json("data.json"))

# Write the data to the file "data.csv"
bt.write_csv(data, "data.csv")

# Read the contents of the file "data.csv"
print(bt.read_csv("data.csv"))

How Can I Pause the Scraper to Inspect Things While Developing?

To pause the scraper and wait for user input before proceeding, use bt.prompt():

bt.prompt()

How Does AntiDetectDriver Facilitate Easier Web Scraping?

AntiDetectDriver is a patched Version of Selenium that has been modified to avoid detection by bot protection services such as Cloudflare.

It also includes a variety of helper functions that make web scraping tasks easier.

You can learn about these methods here.

What Features Does @request Support, Similar to @browser?

Similar to @browser, @request supports features like

asynchronous execution [Will Learn Later]
parallel processing
caching
user-agent customization
proxies, etc.

Below is an example that showcases these features:

@request(parallel=40, cache=True, proxy="http://your_proxy_address:your_proxy_port", data=["https://www.omkar.cloud/", ...])
def scrape_heading_task(request: AntiDetectDriver, link):
  soup = request.bs4(link)
  heading = soup.find('h1').get_text()
  return {"heading": heading}

I have an existing project into which I want to integrate the AntiDetectDriver/AntiDetectRequests.

You can create an instance of AntiDetectDriver as follows:

driver = bt.create_driver()
# ... Code for scraping
driver.quit()

You can create an instance of AntiDetectRequests as follows:

anti_detect_request = bt.create_request()
soup = anti_detect_request.bs4("https://www.omkar.cloud/")
# ... Additional code

Sign Up Bots

Sometimes, when scraping the web, data is hidden behind an authentication wall, requiring you to sign up via email or Google to access it.

Now, let's explore how to use Botasaurus utilities that empower us to create hundreds of accounts on a website. With hundreds of bots at your command:

You can mass message thousands of people on platforms like Twitter, LinkedIn, or Reddit to promote your product.
Platforms like MailChimp offer free plans with limited usage. With bots, you can maximize the benefits of these plans. For instance, if MailChimp allows you to send 500 emails per month for free, with 100 bots, you can send 50,000 emails monthly.

How to Generate Human-Like User Data?

To create human-like user data, use the generate_user function:

user = bt.generate_user(country=bt.Country.IN)

This will generate user profiles similar to the one shown below:

Account

The data generated is very realistic, reducing the likelihood of being flagged as a bot.

The Target Website Has Sent a Verification Email: How to Get the Link and Verify It?

To get the verification link from an email and then delete the mailbox, use bt.TempMail.get_email_link_and_delete_mailbox as shown below:

user = bt.generate_user(country=bt.Country.IN)
email = user["email"]  # Example: madhumitachavare@1974.icznn.com

link = bt.TempMail.get_email_link_and_delete_mailbox(email)  # Retrieves the Verification Link and Deletes the Mailbox

driver.get(link)

I have automated the Creation of User Account's. Now I want to store the User Account Credentials like Email and Password. How to store it?

To store user-related data, such as account credentials, use the ProfileManager module:

bt.Profile.set_profile(user)

In cases where you want to store metadata related to a user, such as API keys:

bt.Profile.set_item("api_key", "BDEC26...")

To retrieve a list of all users, use bt.Profile.get_all_profiles():

profiles = bt.Profile.get_all_profiles()

The Chrome Profiles of User's are getting very large like 100MBs, is there a way to Compress them?

You can use tiny_profile feautre of Botasaurus which are a replacement for Chrome Profiles.

Each Tiny Profile only stores cookies from visited websites, making them extremely lightweight—around 1KB. Here's how to use them:

@browser(
    tiny_profile=True, 
    profile='my-profile',
)
def sign_up_task(driver: AntiDetectDriver, data):
    # Your sign-up code here

How to Dynamically Specify the Profile Based on a Data Item?

You can dynamically select a profile by passing a function to the profile option, which will receive the data item:

def get_profile(data):
    return data["username"]

@browser(
    data=[{"username": "mahendra-singh-dhoni", ...}, {"username": "virender-sehwag", ...}],
    profile=get_profile,
)
def sign_up_task(driver: AntiDetectDriver, data):
    # Your sign-up code here

user_agent, proxy, and other options can also be passed as functions.

Is there a Tutorial that integrates tiny_profile, temp mail, user generator, profile to sign up on a Website and Perform Actions on Website. So I can get a Complete Picture?

For a comprehensive guide on using Botasaurus features such as tiny_profile, temp_mail, user_generator, and profile to sign up on a website and perform actions, read the Sign-Up Tutorial Here.

This tutorial will walk you through signing up for 3 accounts on Omkar Cloud and give you a complete understanding of the process.

How to Run Botasaurus in Docker?

To run Botasaurus in Docker, use the Botasaurus Starter Template, which includes the necessary Dockerfile and Docker Compose configurations:

git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project
cd my-botasaurus-project
docker-compose build && docker-compose up

How to Run Botasaurus in Gitpod?

Botasaurus Starter Template comes with the necessary .gitpod.yml to easily run it in Gitpod, a browser-based development environment. Set it up in just 5 minutes by following these steps:

Open Botasaurus Starter Template, by visiting this link and sign up using your GitHub account.

In the terminal, run the following command to start scraping:

python main.py

Advanced Features

How Do I Configure the Output of My Scraping Function in Botasaurus?

To configure the output of your scraping function in Botasaurus, you can customize the behavior in several ways:

Change Output Filename: Use the output parameter in the decorator to specify a custom filename for the output.

@browser(output="my-output")
def scrape_heading_task(driver: AntiDetectDriver, data): 
    # Your scraping logic here

Disable Output: If you don't want any output to be saved, set output to None.

@browser(output=None)
def scrape_heading_task(driver: AntiDetectDriver, data): 
    # Your scraping logic here

Dynamically Write Output: To dynamically write output based on data and result, pass a function to the output parameter:

def write_output(data, result):
    bt.write_json(result, 'data')
    bt.write_csv(result, 'data')

@browser(output=write_output)  
def scrape_heading_task(driver: AntiDetectDriver, data): 
    # Your scraping logic here

Save Outputs in Multiple Formats: Use the output_formats parameter to save outputs in different formats like CSV and JSON.

@browser(output_formats=[bt.Formats.CSV, bt.Formats.JSON])  
def scrape_heading_task(driver: AntiDetectDriver, data): 
    # Your scraping logic here

These options provide flexibility in how you handle the output of your scraping tasks with Botasaurus.

How to Run Drivers Asynchronously from the Main Process?

To execute drivers asynchronously, enable the async option and use .get() when you're ready to collect the results:

from time import sleep

@browser(
    run_async=True,  # Specify the Async option here
)
def scrape_heading(driver: AntiDetectDriver, data):
    print("Sleeping for 5 seconds.")
    sleep(5)
    print("Slept for 5 seconds.")
    return {}

if __name__ == "__main__":
    # Launch web scraping tasks asynchronously
    result1 = scrape_heading()  # Launches asynchronously
    result2 = scrape_heading()  # Launches asynchronously

    result1.get()  # Wait for the first result
    result2.get()  # Wait for the second result

With this method, function calls run concurrently. The output will indicate that both function calls are executing in parallel.

How to Asynchronously Add Multiple Items and Get Results?

The async_queue feature allows you to perform web scraping tasks in asyncronously in a queue, without waiting for each task to complete before starting the next one. To gather your results, simply use the .get() method when all tasks are in the queue.

Basic Example:

from time import sleep
from your_scraping_library import browser, AntiDetectDriver  # Replace with your actual scraping library

@browser(async_queue=True)
def scrape_data(driver: AntiDetectDriver, data):
    print("Starting a task.")
    sleep(1)  # Simulate a delay, e.g., waiting for a page to load
    print("Task completed.")
    return data

if __name__ == "__main__":
    # Start scraping tasks without waiting for each to finish
    async_queue = scrape_data()  # Initializes the queue

    # Add tasks to the queue
    async_queue.put([1])
    async_queue.put(2)
    async_queue.put([3, 4])

    # Retrieve results when ready
    results = async_queue.get()  # Expects to receive: [1, 2, 3, 4]

Practical Application for Web Scraping:

Here's how you could use async_queue to scrape webpage titles while scrolling through a list of links:

from your_scraping_library import browser, AntiDetectDriver  # Replace with your actual scraping library

@browser(async_queue=True)
def scrape_title(driver: AntiDetectDriver, link):
    driver.get(link)  # Navigate to the link
    return driver.title  # Scrape the title of the webpage

@browser()
def scrape_all_titles(driver: AntiDetectDriver):
    # ... Your code to visit the initial page ...

    title_queue = scrape_title()  # Initialize the asynchronous queue
    
    while not end_of_page_detected(driver):  # Replace with your end-of-list condition
        title_queue.put(driver.links('a'))  # Add each link to the queue
        driver.scroll(".scrollable-element")
        

    return title_queue.get()  # Get all the scraped titles at once

if __name__ == "__main__":
    all_titles = scrape_all_titles()  # Call the function to start the scraping process

Note: The async_queue will only invoke the scraping function for unique links, avoiding redundant operations and keeping the main function (scrape_all_titles) cleaner.

I want to repeatedly call the scraping function without creating new Selenium drivers each time. How can I achieve this?

Utilize the keep_drivers_alive option to maintain active driver sessions. Remember to call .close() when you're finished to release resources:

@browser(
    keep_drivers_alive=True, 
    parallel=bt.calc_max_parallel_browsers,  # Typically used with `keep_drivers_alive`
    reuse_driver=True,  # Also commonly paired with `keep_drivers_alive`
)
def scrape_data(driver: AntiDetectDriver, data):
    # ... (Your scraping logic here)

if __name__ == "__main__":
    for i in range(3):
        scrape_data()
    # After completing all scraping tasks, call .close() to close the drivers.
    scrape_data.close()

How do I manage the Cache in Botasaurus?

You can use The Cache Module in Botasaurus to easily manage cached data. Here's a simple example explaining its usage:

from botasaurus import *
from botasaurus.cache import Cache

# Example scraping function
@request
def scrape_data(data):
    # Your scraping logic here
    return {"processed": data}

# Sample data for scraping
input_data = {"key": "value"}

# Adding data to the cache
Cache.put(scrape_data, input_data, scrape_data(input_data))
# Checking if data is in the cache
if Cache.has(scrape_data, input_data):
    # Retrieving data from the cache
    cached_data = Cache.get(scrape_data, input_data)
# Removing specific data from the cache
Cache.remove(scrape_data, input_data)
# Clearing the complete cache for the scrape_data function
Cache.clear(scrape_data)

Any Tips for Scraping Cloudflare-Protected Websites?

To scrape Cloudflare-protected sites, you need to use a browser; the request module won't work, as it gets detected and results in a 403 error.
Use the google_get method to scrape the target website.
For large-scale scraping, opt for Data Center Proxies over Residential as Residential Proxies are really expensive. Sometimes you will get blocked; so, use retries as demonstrated in the code below:

  from botasaurus import *

  @browser(
          proxy="http://username:password@datacenter-proxy-domain:12321", 
          max_retry=5, # A reliable default for most situations
          block_resources=True, # Enhances efficiency and cost-effectiveness
          )
  def scrape_heading_task(driver: AntiDetectDriver, data):
      driver.google_get("https://www.g2.com/products/github/reviews")
      if driver.is_bot_detected():
          raise Exception("Bot detected")
      heading = driver.text('h1')
      print(heading)
      return heading

How Do I Close All Running Chrome Instances When Developing with Botasaurus?

While developing a scraper, you might need to interrupt the scraping process, often done by pressing Ctrl + C. However, this action does not automatically close the Chrome browsers, which can cause your computer to hang due to resource overuse.

Many Chrome processes running in Task Manager

To prevent your PC from hanging, you need to close all running Chrome instances.

You can run following command to close all chromes

python -m botasaurus.close

Executing above command will close all Chrome instances, thereby helping to prevent your PC from hanging.

What feautres are Coming in Botasaurus 4?

Botasaurus 4, which is currently in its beta phase, allows you to:

A sms API to receive OTPs.
Run bots in the cloud using a website UI and control their schedules, starting/stopping times, and view bot outputs in a web UI.
Use Kubernetes to run thousands of bots in parallel.
Schedule Scraping Tasks at specific times or intervals
Whatsapp/Email Alerts
An API to interface with Gmail and Outlook accounts.
MySQL/PostgreSQL Integration
Integrated Captcha Solving
And Many More :)

Developers are actively using Botasaurus 4 in production environments and saving hours of Development Time. To get access to Botasaurus 4, please reach out to us and let us know which feature you would like to access.

Conclusion

Botasaurus is a powerful, flexible tool for web scraping.

Its various settings allow you to tailor the scraping process to your specific needs, improving both efficiency and convenience. Whether you're dealing with multiple data points, requiring parallel processing, or need to cache results, Botasaurus provides the features to streamline your scraping tasks.

❓ Need More Help or Have Additional Questions?

If you need guidane on your web scraping Project or have some questions, message us on WhatsApp and we'll be happy to help you out.

Thanks

Kudos to the Apify Team for creating proxy-chain library. The implementation of SSL-based Proxy Authentication wouldn't be possible without their groundbreaking work on proxy-chain.
A special thanks to the Selenium team for creating Selenium, an invaluable tool in our toolkit.
Thanks to the creators of the cloudscraper library, which serves as the backbone behind our request Module.
Finally, a big thank you to you for choosing Botasaurus.

Love It? Star It! ⭐

Become one of our amazing stargazers by giving us a star ⭐ on GitHub!

It's just one click, but it means the world to me.

Made with ❤️ in Bharat 🇮🇳 - Vande Mataram

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.74

Oct 22, 2024

4.0.73

Oct 15, 2024

4.0.72

Oct 5, 2024

4.0.63

Oct 4, 2024

4.0.62

Oct 4, 2024

4.0.61

Sep 30, 2024

4.0.60

Sep 23, 2024

4.0.59

Aug 28, 2024

4.0.58

Aug 17, 2024

4.0.56

Aug 16, 2024

4.0.54

Aug 6, 2024

4.0.53

Aug 5, 2024

4.0.52

Jul 28, 2024

4.0.51

Jul 27, 2024

4.0.50

Jul 27, 2024

4.0.47

Jul 10, 2024

4.0.45

Jul 8, 2024

4.0.44

Jul 5, 2024

4.0.42

Jul 4, 2024

4.0.41

Jun 26, 2024

4.0.40

Jun 20, 2024

4.0.39

Jun 20, 2024

4.0.36

Jun 4, 2024

4.0.34

May 30, 2024

4.0.33

May 30, 2024

4.0.32

May 28, 2024

4.0.24

May 21, 2024

4.0.20

May 16, 2024

4.0.19

May 14, 2024

4.0.14

Apr 2, 2024

4.0.11

Mar 29, 2024

4.0.9

Mar 20, 2024

4.0.8

Mar 20, 2024

4.0.7

Mar 20, 2024

4.0.6

Mar 20, 2024

4.0.5

Mar 20, 2024

4.0.4

Mar 20, 2024

4.0.3

Mar 18, 2024

4.0.2

Mar 16, 2024

4.0.1

Mar 14, 2024

3.2.27

Mar 3, 2024

3.2.26

Mar 1, 2024

3.2.25

Feb 26, 2024

3.2.24

Feb 14, 2024

3.2.22

Feb 13, 2024

3.2.21

Feb 10, 2024

3.2.20

Feb 7, 2024

3.2.18

Feb 7, 2024

3.2.17

Feb 7, 2024

3.2.16

Feb 4, 2024

3.2.15

Feb 4, 2024

3.2.14

Feb 4, 2024

3.2.13

Feb 3, 2024

3.2.9

Feb 2, 2024

3.2.8

Jan 29, 2024

3.2.7

Jan 28, 2024

3.2.5

Jan 27, 2024

3.2.4

Jan 26, 2024

3.2.3

Jan 12, 2024

3.2.2

Jan 10, 2024

3.2.1

Jan 10, 2024

3.2.0

Jan 10, 2024

3.1.28

Jan 10, 2024

3.1.27

Jan 9, 2024

3.1.26

Jan 9, 2024

3.1.25

Jan 9, 2024

3.1.24

Jan 9, 2024

3.1.23

Jan 9, 2024

3.1.22

Jan 9, 2024

3.1.21

Jan 9, 2024

3.1.20

Jan 9, 2024

3.1.19

Jan 9, 2024

3.1.18

Jan 7, 2024

3.1.17

Jan 6, 2024

3.1.16

Jan 6, 2024

This version

3.1.15

Jan 6, 2024

3.1.13

Dec 31, 2023

3.1.12

Dec 26, 2023

3.1.11

Dec 25, 2023

3.1.10

Dec 25, 2023

3.1.9

Dec 23, 2023

3.1.8

Dec 23, 2023

3.1.7

Dec 23, 2023

3.1.6

Dec 23, 2023

3.1.5

Dec 23, 2023

3.1.3

Dec 23, 2023

3.1.2

Dec 23, 2023

3.1.1

Dec 23, 2023

3.1.0

Dec 23, 2023

3.0.13

Dec 21, 2023

3.0.12

Dec 13, 2023

3.0.10

Dec 10, 2023

3.0.9

Dec 5, 2023

3.0.8

Dec 3, 2023

3.0.7

Dec 3, 2023

3.0.5

Dec 3, 2023

3.0.4

Dec 2, 2023

3.0.3

Nov 29, 2023

3.0.2

Nov 21, 2023

3.0.1

Nov 19, 2023

3.0.0

Nov 17, 2023

2.0.19

Oct 14, 2023

2.0.18

Oct 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botasaurus-3.1.15.tar.gz (81.6 kB view hashes)

Uploaded Jan 6, 2024 Source

Hashes for botasaurus-3.1.15.tar.gz

Hashes for botasaurus-3.1.15.tar.gz
Algorithm	Hash digest
SHA256	`9a904263e45deeea00d851bb95f5ecfa39825649471156e211a6557ddf379e71`
MD5	`09cadef1c92fc6622b43f3d41353cab6`
BLAKE2b-256	`8bccb475504c7484a0744d748fcefe84622d3bcb327bd43f1d891c380e8ab993`

botasaurus 3.1.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤖 Botasaurus 🤖

The web has evolved. Finally, web scraping has too.

In a nutshell

Features

🚀 Getting Started with Botasaurus

Step 1: Install Botasaurus

Step 2: Set Up Your Botasaurus Project

Step 3: Write the Scraping Code

Step 4: Run the Scraping Task

Step 5: Run the Scraping Task (Using Anti Detect Requests)

💡 Understanding Botasaurus

Could you show me an example where you defeat Cloudflare?

How to Scrape Multiple Data Points/Links?

How to Scrape in Parallel?

How to know how many scrapers to run parallely?

How to Cache the Web Scraping Results?

How Botasaurus helps me in debugging?

How to Block Resources like CSS, Images, and Fonts to Save Bandwidth?

How to Configure UserAgent, Proxy, Chrome Profile, Headless, etc.?

Do you support SSL for Authenticated Proxies?

I want to Scrape a large number of Links, a new selenium driver is getting created for each new link, this increases the time to scrape data. How can I reuse Drivers?

Could you show me a practical example where all these Botasaurus Features Come Together to accomplish a typical web scraping project?

How to Clean Data?

How to Read/Write JSON and CSV Files?

How Can I Pause the Scraper to Inspect Things While Developing?

How Does AntiDetectDriver Facilitate Easier Web Scraping?

What Features Does @request Support, Similar to @browser?

I have an existing project into which I want to integrate the AntiDetectDriver/AntiDetectRequests.

Sign Up Bots

How to Generate Human-Like User Data?

The Target Website Has Sent a Verification Email: How to Get the Link and Verify It?

I have automated the Creation of User Account's. Now I want to store the User Account Credentials like Email and Password. How to store it?

The Chrome Profiles of User's are getting very large like 100MBs, is there a way to Compress them?

How to Dynamically Specify the Profile Based on a Data Item?

Is there a Tutorial that integrates tiny_profile, temp mail, user generator, profile to sign up on a Website and Perform Actions on Website. So I can get a Complete Picture?

How to Run Botasaurus in Docker?

How to Run Botasaurus in Gitpod?

Advanced Features

How Do I Configure the Output of My Scraping Function in Botasaurus?

How to Run Drivers Asynchronously from the Main Process?

How to Asynchronously Add Multiple Items and Get Results?

Basic Example:

Practical Application for Web Scraping:

I want to repeatedly call the scraping function without creating new Selenium drivers each time. How can I achieve this?

How do I manage the Cache in Botasaurus?

Any Tips for Scraping Cloudflare-Protected Websites?

How Do I Close All Running Chrome Instances When Developing with Botasaurus?

What feautres are Coming in Botasaurus 4?

Conclusion

❓ Need More Help or Have Additional Questions?

Sponsors

Special Sponsor

Thanks

Love It? Star It! ⭐

Made with ❤️ in Bharat 🇮🇳 - Vande Mataram

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution