bs4-web-scraper

A web scraper based on the BeautifulSoup4 library and translators package.

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

bs4_web_scraper

A web scraper based on the BeautifulSoup4 library with translation capabilities.

Dependencies

Python 3.11
beautifulsoup4
translators
requests
lxml
html5lib
pyyaml
toml

Setup

Make sure you have python3 installed on your local machine.
Clone the repository to local machine into your project directory.
Change directory into the repository "bs4_web_scraper" and pip install -r requirements.txt. You're ready to go if the installations were successful.
For a quick demo, run example.py.

Features

Web scraping
Translation
Saving scraped data to a file
Downloading data from a web page or URL
Logging the scraping process

Usage

Before using the scraper, make sure you have an internet connection. The scraper uses the internet to scrape web pages and translate scraped data.

Importing the module

from bs4_web_scraper.scraper import BS4WebScraper

Creating a scraper object

The following example shows how to instantiate and customize the scraper's settings. The default settings are used if no parameters are passed to the scraper object.

# Here, the scraper object is created with the default settings
bs4_scraper = BS4WebScraper()

# To customize the scraper's settings, pass a dictionary of the preferred instantiation parameters to the scraper object.
params = {
    "parser": "html.parser",
    "markup_filename": "base.html",
    "log_filepath": "./scrape_log/log.txt",
    "no_of_requests_before_pause": 30, # This should not exceed 50 to avoid high frequency requests. The upper limit is 100
    "scrape_session_pause_duration": 20, # pause duration in seconds. It is advisable to leave this at its default, "auto".
    "max_no_of_retries": 5,
    "base_storage_dir": "./scraped_data",
    "storage_path": ".",
    "translation_engine": "bing",
    
}

bs4_scraper = BS4WebScraper(**params)

Instantiation parameters

To read more about the instantiation parameters and class attributes, run the following command:

>>> print(BS4WebScraper.__doc__)

Scraping a web page

Most web scraping tasks can be done using the scrape method. Below is an example of how to scrape a web site.

# Scraping google.com
bs4_scraper.scrape(url="https://www.google.com", scrape_depth=0)

In the above example, the scrape_depth parameter is set to 0. This means that the scraper will only scrape the web page at the given url. If the scrape_depth parameter is set to 1, the scraper will scrape the web page at the given url and all the web pages linked to it. If the scrape_depth parameter is set to 2, the scraper will scrape the web page at the given url and all the web pages linked to it and all the web pages linked to the web pages linked to it and so on.

Translating scraped data

To translate the scraped data, set the translate_to parameter to the language you want to translate to. The following example shows how to translate the scraped data to French. The translation is done using the translation engine specified in the instantiation parameters. The default translation engine is "google" (Google Translate). To change the translation engine, set the translation_engine parameter or attribute to the preferred translation engine. The following example shows how to translate the scraped data to French using the Bing translation engine.

# Scraping a web page and translating the scraped data to French (during instantiation)
bs4_scraper = BS4WebScraper(..., translation_engine="bing")
# or (after instantiation)
bs4_scraper.translation_engine = "bing"

bs4_scraper.scrape(url="https://www.google.com", scrape_depth=0, translate_to="fr")

To get a list of available translation engines you can use, do the following:

from bs4_web_scraper import translate

# INTERNET CONNECTION REQUIRED
print(translate.translation_engines)

Translation of the web pages is done immediately after scraping. The translate_to parameter is set to "fr" in the above example. This means that the scraped data will be translated to French. The translate_to parameter can be set to any of the languages supported by the scraper's translation engine.

To get a list of the languages supported by the scraper's translation engine, do:

print(bs4_scraper.translator.supported_languages)

Scraping web sites or pages that require authentication

To scrape websites or pages that require authentication, you can pass the credentials parameter to the scrape method.The following example shows how to scrape a web page that requires authentication.

# Scraping a web page that requires authentication
credentials = {
    'auth_url': 'https://www.websitewithauth.com/login_path/',
    'auth_username_field': 'usernamefieldname',
    'auth_password_field': 'passwordfieldname',
    'auth_username': 'yourusername',
    'auth_password': 'yourpassword',
}

bs4_scraper.scrape(url="https://www.websitewithauth.com", scrape_depth=0, credentials=credentials)

You can also authenticate the scraper before scraping by passing the credentials parameter to the authenticate method. The following example shows how to authenticate the scraper before scraping.

# Authenticating the scraper before scraping
bs4_scraper.authenticate(credentials=credentials)
bs4_scraper.scrape(url="https://www.websitewithauth.com", scrape_depth=0)

# or in the case of downloading data from a web page that requires authentication
bs4_scraper.authenticate(credentials=credentials)
bs4_scraper.download_url(url="https://www.websitewithauth.com/download/example.mp4", save_as="example.mp4", save_to="downloads")

# run help(bs4_scraper.downloaded_url) for more information on the download_url method

NOTE: credentials should always take the form of a dictionary with the following keys: auth_url, auth_username_field, auth_password_field, auth_username, auth_password.

To get a quick template for the credentials dictionary, do:

import bs4_web_scraper

print(bs4_web_scraper.credentials_template)

Other useful methods

The following are some useful methods for scraping web data using the scraper class.

download_url
download_urls
find_all
find_all_tags
find_links
find_stylesheets
find_scripts
find_videos
find_images
find_audios
find_fonts
find_pattern
find_emails
find_phone_numbers

For information on how to use these methods, do:

>>> help(bs4_scraper.<method_name>)

Other utility classes included in the module

Translator
FileHandler
Logger
RequestLimitSetting

For information on how to use these classes, do:

from bs4_web_scraper.<module_name> import <class_name>

>>> help(<class_name>)

Credits

Contributors and feedbacks are welcome. For feedbacks, please open an issue or contact me at tioluwa.dev@gmail.com or on twitter @ti_oluwa_

To contribute, please fork the repo and submit a pull request

If you find this module useful, please consider giving it a star. Thanks!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.2.2

Nov 2, 2023

0.2.1

Nov 2, 2023

0.2.0

Sep 20, 2023

0.1.9

Sep 9, 2023

0.1.8

Sep 9, 2023

0.1.7

Sep 8, 2023

0.1.6

Sep 5, 2023

0.1.5

Sep 5, 2023

0.1.4

Jul 3, 2023

0.1.3

Jun 4, 2023

0.1.2

May 28, 2023

0.1.1

May 27, 2023

0.0.2a0 pre-release

May 27, 2023

This version

0.0.1a0 pre-release

May 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bs4_web_scraper-0.0.1a0.tar.gz (35.9 kB view hashes)

Uploaded May 26, 2023 Source

Built Distribution

bs4_web_scraper-0.0.1a0-py3-none-any.whl (37.8 kB view hashes)

Uploaded May 26, 2023 Python 3

Hashes for bs4_web_scraper-0.0.1a0.tar.gz

Hashes for bs4_web_scraper-0.0.1a0.tar.gz
Algorithm	Hash digest
SHA256	`279ba4db852f11688b52fffa209e5c0084dc284aec57c35e6b3ea8368bf92c6f`
MD5	`997d7b585661ba05211c315de5ddec1e`
BLAKE2b-256	`81b007663f2d3692abfb5b26dd11b1ec847b7c100371be1c7494cc932e27da09`

Hashes for bs4_web_scraper-0.0.1a0-py3-none-any.whl

Hashes for bs4_web_scraper-0.0.1a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e149bad5735786f5ac668f181cf86d4b1c96d94e546f676f25a0a0219d4da3ca`
MD5	`b6e5c19b6bc9dc4fb40d46a215a1e91d`
BLAKE2b-256	`9d7d2a633bfce68df85582e093bbe81bd2220670941b643ee1af1d501ea19b8b`